An Empirical Approach to Predict Runaway Software


Projects Using Bayesian Classication


Osame Mizuno    Tohru Kikuno

Graduate School of Information Science and Graduate School of
Information Science and


Technology Technology


Osaka University Osaka University


1-5 Yamadaoka, Suita, Osaka, Japan 1-5 Yamadaoka, Suita, Osaka,
Japan


o-mizuno@ist.osaka-u.ac.jp      kikuno@ist.osaka-u.ac.jp

ABSTRACT

This paper proposes an automated approach to identify and predict
a runaway status of software development projects at an early stage
of development processes. So far, we have proposed an questionnaire-based
approach using the logistic regression analysis to identify risk
factors empirically in software projects. Although the risk factors
were successfully identied from the questionnaire in the previous
approach, the prediction of runaway projects still had several
problems.

In order to solve the problems and to construct an automated
prediction system, we adopt the Bayesian classication methods for
the prediction in such a way that the projects are classied into
runaway or successful ones by applying the Bayesian classier to the
responses of the questionnaire. The result of experiments using
actual project data shows that the proposed approach can predict
87.5% of projects correctly.

Keywords: Software project prediction, Data-mining methods, Project
management

1.      INTRODUCTION

Software development projects are required to produce highly reliable
systems within a short period and at low cost. This contradictory
requirements make development projects risky ones. In fact, many
studies pointed out increase of runaway projects in software
development eld [6,9].

Problems in software development must be detected and avoided as
early as possible. Thus, detecting signs of problems at an early
stage of the software project is important. If the detection of a
problem is delayed, it becomes more dicult to x a problem since the
eort of coping with a problem increases exponentially [1].

Much research has been carried out about the detection of problem
signs of software development projects [7, 16]. A case study of
success prediction has been carried out in [11]. Concerns for risk
management are increasing for early detection of such problem signs
in software projects. The Software Risk Evaluation method (SRE) is
a risk-management technique for a software development project [15].
In the SRE, the projects risks are identied using the taxonomy table
of software risks. The risk taxonomy table is very useful for
systematically identifying risks of a project. However, since many
risk attributes exists in the identication of risks using the
taxonomy table, the extraction of a risk takes time. Therefore, SRE
recommends to carrying out the tailoring of the taxonomy table for
each project. Risks of a software development project are inuenced
by environments such as the domain, the business style, the culture
of the organization, and human characteristics. In the projects
with similar environments, an approach to prevent recurrence of
problems by analyzing past problems is usually taken. Such an
approach is easily understood by the project members, since it is
based on the actually occurring problem.

We have proposed an approach to characterize a runaway software
projects using logistic regression analysis [10, 13]. In this
approach, the risk factors are identied successfully and the
experimental prediction shows relatively good result. However, the
approach relays on the manual selection of attribute variables.
Thus the prediction was not done automatically.

In order to construct an automated system to predict runaway software
projects, we try to adopt Bayesian classication to the prediction
system. The naive Bayesian classication is known as optimal among
a lot of supervised learning methods when the values of the attributes
of an example data are independent given the class of the example
data. Since there are many extended versions of Bayesian classier,
it is necessary to identify the most accurate one for practical
use. In this paper, we compare the accuracy of different versions
of Bayesian classier for actual project data given by a certain
company. By adopting the most accurate Bayesian classier that we
nd out, the prediction system can achieve 87.5% of accuracy on
average for applied data.


2.      AUTOMATED PREDICTION OF SOFTWARE PROJECTS

Figure 1: Overview of our approach: Preparation phase

Figure 2: Overview of our approach: Prediction phase

2.1 Overview

Figure 1 shows an overview of a preparation phase of our approach
to make a prediction model of runaway projects. In Step 1, we
designed a questionnaire to be distributed to project managers and
leaders in order to collect the assessment data. The questionnaire
is constructed based on our previous works and various software
risk research. The SEPG distributed the questionnaire to project
members, and asked them to ll out the questionnaire. Next, in Step
2, after the project members nished lling out the questionnaire,
the SEPG collect and store them into the project database. At the
same time, in Step 3, SEPG determined the success variable and
determine the values of them for each development project from
available project data.

In the prediction phase, as shown in Figure 2, we perform prediction
of runaway projects in the following steps: First, in Step 2, the
project members also ll out the questionnaire, and then, the SEPG
collect them. Next, in Step 4, we apply a data-mining technique (in
the case of this study, Bayesian classier) to project database and
construct a prediction model to identify the runaway projects.
Finally, we apply the prediction model to the collected responses
to the questionnaire and obtain a result of prediction, that is,
whether a project is runaway or successful.

The proposed approach is now implementing in our laboratory. The
implemented system will automate the collection of questionnaire,
constructing prediction model, and the feedback of predicted results.

The questionnaire consists of 22 questions to ask for possible signs
of problems in software development. Each question must be lled in
the ordered scale [5]. The detail of designing questionnaire will
be described in Section 3.1.

In our previous research [10], we constructed a logistic model with
ve risk factors used for the parameters. In this paper, we change
the modeling technique with more generic one, that is the Bayesian
classier, since it is more robust approach to apply more empirical
data from various industries.


2.2 Naive Bayesian Classier

The naive Bayesian classication is known as optimal among a lot of
supervised learning methods when the values of the attributes of
an example data are independent given the class of the example data.
Although this assumption is almost always violated in practice,
recent research has shown that naive Bayesian learning is also
eective in practice [3]. We introduce several fundamental concepts
briey.

2.2.1 Bayesian learning

Let Q1,Q2, ,Qn be the parameters with discrete val


ues to predict a discrete class C. For example, Qi denotes the
questionnaire and C denotes the nal status of a project: runaway(r)
and successful(s). Suppose that the values q1,q2, ,qn are given to
these parameters, and the


optimal prediction is the class value C = r such that P (C = r|Q1
= q1  Q2 = q2  Qn = qn) is maximum. By the Bayes theorem, this
probability is expressed as follows:

P (Q1 = q1  Q2 = q2  Qn = qn|C = r)  P (C = r).

P (Q1 = q1  Q2 = q2  Qn = qn)

The probability P (C = r) can be easily estimated from training
data. Furthermore, P (Q1 = q1  Q2 = q2  Qn = qn) is irrelevant to
the class variable C. Therefore, learning is reduced to the problem
of estimating P (Q1 = q1  Q2 = q2    Qn = qn|C = r) from training
data. Using Bayes theorem again, this conditional probability can
be written as follows:

P (Q1 = q1Q2 = qn,C = r)|= q2  Qn

 P (Q2 = q2  Qn = qn|C = r)

The second factor of above formula is recursively formulated as
follows:

P (Q2 = q2Q3 = qn,C = r)|= q3  Qn

 P (Q3 = q3  Qn = qn|C = r)

Here, we assume that Qis (1  i  n) are independent each other. In
other words, we assume

P (Q1 = q1Q2 = qn,C = r)= P (Q1 = q1C = r)|= q2  Qn |

and so on. Then, P (Q1 = q1 Q2 = q2  Qn = qn|C = r) equals

P (Q1 = q1|C = r)  P (Q2 = q2|C = r)

 P (Qn = qn|C = r).

P (Qi|C = r) (1  i  n) can be estimated from training data. The
process of Bayesian learning is thus performed.


2.3 Bayesian classication

Using the result of learning, we can classify and predict the value
of C if the values of qi are given.

P (C = rQ1 = qn)


|= q1  Qn


n

Y

=( C = r)) = r)/zP (Qi = qi| P (C

i=1

where z is a normalizing constant. We can thus calculate the
conditional probability for any given values of parameters qis using
this equation. From the calculated probability, we classify the
data into either class, runaway or successful. Here, we determine
a project is in class runaway if P (C = r|Q1 = q1  Qn = qn)  0.5.


3.      METRICS FOR PROJECT PREDICTION

3.1 Design of the Questionnaire

In this study, we have investigated various works [2,4,8,12] regarding
risk management and the experience of a certain company. Based on
the results of this investigation, we have summarized all key risk
factors and classied them into the following ve viewpoints: (1)
Requirements, (2) Estimations, (3) Team organization, (4) Planning
capability, and (5) Project management activities. The overview of
the questionnaire is shown as follows:

3.1.1 Requirements The Requirements viewpoint includes factors which
are related to the understanding and commitment of the requirements
among the project members. The factors for the requirements viewpoint
are distinguished as follows:

(1.1) Ambiguous requirements

(1.2) Insucient explanation of the requirements

(1.3) Misunderstanding of the requirements

(1.4)   Lack of commitment regarding requirements between the
customer and the project members

(1.5) Frequent requirement changes

3.1.2 Estimations The Estimations viewpoint includes factors related
to the estimation itself, the technical methods for carrying out
the estimation, and the commitment between project members and
customers. The factors for the estimation viewpoint are distinguished
as follows:

(2.1)   Insucient awareness of the importance of the estimation

(2.2) Insucient skills or knowledge of the estimation method

(2.3) Insucient estimation for the implicit requirements

(2.4) Insucient estimation for the technical issues

(2.5) Lack of stakeholders commitment for estimation

3.1.3 Planning The Planning viewpoint includes factors related to
the planning or scheduling activity and the commitment for the
project plan among project members. The factors for the planning
viewpoint are distinguished as follows:

(3.1) Lack of management review for the project plan

(3.2) Lack of assignment of responsibility

(3.3) Lack of breakdown of the work products

(3.4) Unspecied project review milestones

(3.5)   Insucient planning of project monitoring and controlling

(3.6)   Lack of project members commitment for the project plan

3.1.4 Team organization The Team organization viewpoint includes
factors related to the stang of the projects, the fundamental skills
and experience and morale of project members. The factors for the
team organization viewpoint are as follows:

(4.1) Lack of skills and experience

(4.2) Insucient allocation of resources

(4.3) Low morale

3.1.5 Project management activities The Project management activities
viewpoint includes factors related to the project management
activities. The factors which distinguish the project management
activities viewpoint are as follows:

(5.1)   Project manager lack of resource management throughout a
project

(5.2) Inadequate project monitoring and controlling

(5.3) Lack of data needed to keep objective track of a project


3.2 Result of Questionnaire

The questionnaire was delivered to 40 development projects in a
certain company. The responses were collected one month later. These
projects are actual development of embedded systems in the industry
and they were performed from 1996 to 1998. For more detail of the
project data used in our study, please refer to [10, 13].

The SEPG distributed the questionnaires to the project managers or
the project leaders of 40 target projects, and explained the details
of the questionnaire and the purpose


Table 1: Projects used for our experiment

of the trial. The responses to the questionnaire were collected by
the SEPG after one month. Table 1 shows the collected responses.
In Table 1, the answers Strongly agree, Agree, Neither agree nor
disagree, and Disagree are shown as characters 3, 2, 1, and 0,
respectively.

Since all of these projects completed their development, the SEPG
had already identied the runaway projects according to the decision
process mentioned in Section 2. As a result, 13 projects out of 40
were classied as runaway projects. Thus, the column, Result, in
Table 1, shows the actual result of the classication.


4. EXPERIMENTAL PREDICTION

In this experiment, we apply the proposed runaway prediction approach
to the response data of questionnaire obtained from actual projects
in a certain company. By applying our approach, we will identify
the most accurate Bayesian classication approach for the prediction
of runaway projects.

4.1 Empirical Data for Experiment

The software development projects used in this study were for
embedded software. As mentioned in subsection 3.2, we collected 40
response data of questionnaire from 40 projects shown in Table 1.


4.2 Bayes-based Data-Mining Methods

Several data-mining methods based on Bayesian classication have
been proposed so far in order to deal with various assumptions and
environments. They are implemented in major data-mining tools. As
a data-mining tool, we adopted Weka [14] in this experiment. We
adopted the following methods in the experiment, which is implemented
in Weka and is applicable to the data we used:

1.

Naive Bayesian classier: Naive-Bayes


2.

Naive   Bayesian classier with assumption of normal distribution:
Naive-Bayes(ND)


3.

Naive Bayesian classier with kernel estimation: Naive-Bayes(KD)


4.

Bayesian Network: Bayes-Net


5.

Complement Naive Bayes: Complement-Naive-Bayes


6.

Multinomial Naive Bayes: Naive-Bayes-Multinomial


Table 2: Result of Jackknife validation for each Bayes-based
classication method

Simply speaking, Naive-Bayes is the most simple Bayesian classier
that can use only nominal variables. Naive-Bayes(ND) and Naive-Bayes(KD)
are extended version of Naive-Bayes in order to deal with continuous
variables. Since Bayes-Net adopts network structure to represent
dependencies between attribute variables, it can represent more
complex relationships. Complement-Naive-Bayes and Naive-Bayes-Multinomial
are also extended version of Naive-Bayes to deal with uneven
distribution of datasets.

Since we do not have any knowledge on the accuracy of these method,
we performed prediction procedure using all of them.


4.3 Jackknife Validation

In order to evaluate the accuracy of prediction results, we perform
cross validations on the collected data. The evaluation through
k-fold cross validation method is one of the most common in machine
learning community. The data set is here split into k equally sized
subsets, and then in i-th iteration (i =1 k) i-th subset is used
for testing the


classier that has been learned from all other remaining subsets.
Notice that each instance of data is classied exactly once. Especially,
k-fold cross validation is called Jackknife validation when k is
equal to the number of data in the data set.

Let us show an example of Jackknife validation using the data in
Table 1. First, we choose a project for testing, and thus rest 39
projects are used for learning. The selected project is then classied
by the learned prediction model. For each project in Table 1, the
same procedure is performed. Finally, we obtain 40 testing results
to evaluate the overall accuracy of the prediction.


4.4 Result of Prediction

We have applied Jackknife validation to 6 Bayes-based data-mining
methods and obtained the accuracy for each method. Table 2 shows
the result of experiments.

In Table 2, we can see that Naive-Bayes(ND) has the highest accuracy
(that is, 0.875) of all; it can predict 35 out of 40 projects
correctly. Naive-Bayes(KD) has the second highest accuracy (0.850).
It is remarkable that we can predict the status of projects by 87.5%
accuracy on average.

However, most of the Bayes-based data-mining methods used in this
experiment show high accuracy. This result implies that our decision
of selecting Bayesian classier was relatively appropriate.


5. DISCUSSION

At this point, we have relatively high accuracy on predicting runaway
software projects using Bayesian Classier. However, we have identied
several problems as follows:

1.

We cannot oer any suggestions for improvements.


Our proposed approach can only oer a probability of being runaway
for each project. vWhat developers really want to know is what should
be done to avoid runaway status. In order to identify the most
inuential factors that make a project runaway, we will introduce
methods to extract such factors. For example, we guess that the
logistic regression analysis or decision tree analysis may be useful
for risk identication. Combining the prediction of projects and the
identication of risk factors is one of the most interesting future
research.


2.

The number of applied data is too small.


Since our research is in an experimental stage, we cannot collect
much more data immediately. Especially, the preparation phase (See
Figure 1) needs much eort and duration. We are now contacting several
industries and preparing to apply our approach. Conducting much
more experiments in other environments is expected in our future
works.


6. CONCLUSION

In this paper, we proposed an automated approach to predict runaway
software projects using the Bayesian classication technique. We
conrmed that the Bayesian classication model can be applicable to
the response data of questionnaire and the prediction can be done
with high accuracy. Since the data used in this study are the
information regarding the nal status of projects and the responses
of the questionnaire about the software risks, this data can be
collected easily in actual development elds.

As future research works, it is necessary to apply experiments to
other datasets in order to obtain more generic results. It is also
important to extend our approach to provide suggestions for avoiding
runaway projects.

Acknowledgment

Authors would like to thank Dr. Yasunari Takagi of OMRON Corporation
who gave us empirical data of their software development and
invaluable advise to this research.

This research was partially supported by the Ministry of Education,
Science, Sports, and Culture in Japan, Grant-in-Aid for Scientic
Research(C), Grant No. 15500022 (20032005).


7. REFERENCES

[1] B. W. Boehm. Industrial software metrics top 10 list. IEEE
Software, 4(5):8485, 1987.

[2] E. H. Conrow and P. S. Shishido. Implementing risk


management on software intensive projects. IEEE


Software, 14(3):8389, 1997.


[3] P. Domingos and M. J. Pazzani. On the optimality of the simple
Bayesian classier under zero-one loss. Machine Learning, 29(2-3):103130,
1997.

[4] R. Fairley and P. Rook. Risk management for software development.
In Software Engineering, pages 387400. IEEE CS Press, 1997.

[5] N. E. Fenton and S. L. Peeger. Software Metrics : A Rigorous &
Practical Approach. PWS Publishing, 1997.

[6] R. L. Glass. How not to prepare for a consulting


assignment and other ugly consultancy truths.


Communication of the ACM, 41(12):1113, 1998.


[7] J. Jiang and G. Klein. Software development risks to project
eectiveness. Journal of Systems and Software, 52:310, 2000.

[8] D. W. Karolak. Software Engineering Risk


Management. IEEE CS Press, CA, 1996.


[9] S. . McConnell. Rapid Development. Microsoft Press, 1996.

[10] O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto. Characterization
of risky projects based on project managers evaluation. In Proc.
of 22nd International Conference on Software Engineering, pages
387395, 2000.

[11] J. D. Procaccino, J. M. Verner, S. P. Overmyer, and

M. E. Darter. Case study: factors for early prediction of software
development success. Information and Software Technology, 44:5362,
2002.

[12] F. J. Sisti and S. Joseph. Software risk evaluation method
version 1.0. Technical Report CMU/SEI-94-TR-19, Software Engineering
Institute, 1994.

[13] Y. Takagi, O. Mizuno, and T. Kikuno. An empirical approach to
characterizing risky software projects based on logistic regression
analysis. Empirical Software Engineering, (to appear).

[14] Weka Machine Learning Project. Weka 3: Data mining software
in java. http://www.cs.waikato.ac.nz/ ml/weka/.

[15] R. C. Williams, G. J. Pandelios, and S. G. Behrens. Software
risk evaluation (SRE) method description (version 2.0). Technical
Report CMU/SEI-99-TR-029, Software Engineering Institute, 1999.

[16] C. Wohlin and A. A. Andrews. Prioritizing and assessing software
project success factors and project characteristics using subjective
data. Empirical Software Engineering, 8:285303, 2003.