An Empirical Approach to Predict Runaway Software Projects Using Bayesian Classi¥cation Osame Mizuno Tohru Kikuno Graduate School of Information Science and Graduate School of Information Science and Technology Technology Osaka University Osaka University 1-5 Yamadaoka, Suita, Osaka, Japan 1-5 Yamadaoka, Suita, Osaka, Japan o-mizuno@ist.osaka-u.ac.jp kikuno@ist.osaka-u.ac.jp ABSTRACT This paper proposes an automated approach to identify and predict a runaway status of software development projects at an early stage of development processes. So far, we have proposed an questionnaire-based approach using the logistic regression analysis to identify risk factors empirically in software projects. Although the risk factors were successfully identi¥ed from the questionnaire in the previous approach, the prediction of runaway projects still had several problems. In order to solve the problems and to construct an automated prediction system, we adopt the Bayesian classi¥cation methods for the prediction in such a way that the projects are classi¥ed into ÓrunawayÓ or ÓsuccessfulÓ ones by applying the Bayesian classi¥er to the responses of the questionnaire. The result of experiments using actual project data shows that the proposed approach can predict 87.5% of projects correctly. Keywords: Software project prediction, Data-mining methods, Project management 1. INTRODUCTION Software development projects are required to produce highly reliable systems within a short period and at low cost. This contradictory requirements make development projects risky ones. In fact, many studies pointed out increase of runaway projects in software development ¥eld [6,9]. Problems in software development must be detected and avoided as early as possible. Thus, detecting signs of problems at an early stage of the software project is important. If the detection of a problem is delayed, it becomes more di¥cult to ¥x a problem since the e¥ort of coping with a problem increases exponentially [1]. Much research has been carried out about the detection of problem signs of software development projects [7, 16]. A case study of success prediction has been carried out in [11]. Concerns for risk management are increasing for early detection of such problem signs in software projects. The Software Risk Evaluation method (SRE) is a risk-management technique for a software development project [15]. In the SRE, the projectÕs risks are identi¥ed using the taxonomy table of software risks. The risk taxonomy table is very useful for systematically identifying risks of a project. However, since many risk attributes exists in the identi¥cation of risks using the taxonomy table, the extraction of a risk takes time. Therefore, SRE recommends to carrying out the tailoring of the taxonomy table for each project. Risks of a software development project are in¥uenced by environments such as the domain, the business style, the culture of the organization, and human characteristics. In the projects with similar environments, an approach to prevent recurrence of problems by analyzing past problems is usually taken. Such an approach is easily understood by the project members, since it is based on the actually occurring problem. We have proposed an approach to characterize a runaway software projects using logistic regression analysis [10, 13]. In this approach, the risk factors are identi¥ed successfully and the experimental prediction shows relatively good result. However, the approach relays on the manual selection of attribute variables. Thus the prediction was not done automatically. In order to construct an automated system to predict runaway software projects, we try to adopt Bayesian classi¥cation to the prediction system. The naive Bayesian classi¥cation is known as optimal among a lot of supervised learning methods when the values of the attributes of an example data are independent given the class of the example data. Since there are many extended versions of Bayesian classi¥er, it is necessary to identify the most accurate one for practical use. In this paper, we compare the accuracy of different versions of Bayesian classi¥er for actual project data given by a certain company. By adopting the most accurate Bayesian classi¥er that we ¥nd out, the prediction system can achieve 87.5% of accuracy on average for applied data. 2. AUTOMATED PREDICTION OF SOFTWARE PROJECTS Figure 1: Overview of our approach: Preparation phase Figure 2: Overview of our approach: Prediction phase 2.1 Overview Figure 1 shows an overview of a preparation phase of our approach to make a prediction model of runaway projects. In Step 1, we designed a questionnaire to be distributed to project managers and leaders in order to collect the assessment data. The questionnaire is constructed based on our previous works and various software risk research. The SEPG distributed the questionnaire to project members, and asked them to ¥ll out the questionnaire. Next, in Step 2, after the project members ¥nished ¥lling out the questionnaire, the SEPG collect and store them into the project database. At the same time, in Step 3, SEPG determined the success variable and determine the values of them for each development project from available project data. In the prediction phase, as shown in Figure 2, we perform prediction of runaway projects in the following steps: First, in Step 2, the project members also ¥ll out the questionnaire, and then, the SEPG collect them. Next, in Step 4, we apply a data-mining technique (in the case of this study, Bayesian classi¥er) to project database and construct a prediction model to identify the runaway projects. Finally, we apply the prediction model to the collected responses to the questionnaire and obtain a result of prediction, that is, whether a project is runaway or successful. The proposed approach is now implementing in our laboratory. The implemented system will automate the collection of questionnaire, constructing prediction model, and the feedback of predicted results. The questionnaire consists of 22 questions to ask for possible signs of problems in software development. Each question must be ¥lled in the ordered scale [5]. The detail of designing questionnaire will be described in Section 3.1. In our previous research [10], we constructed a logistic model with ¥ve risk factors used for the parameters. In this paper, we change the modeling technique with more generic one, that is the Bayesian classi¥er, since it is more robust approach to apply more empirical data from various industries. 2.2 Naive Bayesian Classi¥er The naive Bayesian classi¥cation is known as optimal among a lot of supervised learning methods when the values of the attributes of an example data are independent given the class of the example data. Although this assumption is almost always violated in practice, recent research has shown that naive Bayesian learning is also e¥ective in practice [3]. We introduce several fundamental concepts brie¥y. 2.2.1 Bayesian learning Let Q1,Q2, ,Qn be the parameters with discrete val ááá ues to predict a discrete class C. For example, Qi denotes the questionnaire and C denotes the ¥nal status of a project: runaway(r) and successful(s). Suppose that the values q1,q2, ,qn are given to these parameters, and the ááá optimal prediction is the class value C = r such that P (C = r|Q1 = q1 ¥ Q2 = q2 ¥ááᥠQn = qn) is maximum. By the Bayes theorem, this probability is expressed as follows: P (Q1 = q1 ¥ Q2 = q2 ¥ááᥠQn = qn|C = r) ¥ P (C = r). P (Q1 = q1 ¥ Q2 = q2 ¥ááᥠQn = qn) The probability P (C = r) can be easily estimated from training data. Furthermore, P (Q1 = q1 ¥ Q2 = q2 ¥ááᥠQn = qn) is irrelevant to the class variable C. Therefore, learning is reduced to the problem of estimating P (Q1 = q1 ¥ Q2 = q2 ¥ ááá ¥ Qn = qn|C = r) from training data. Using Bayes theorem again, this conditional probability can be written as follows: P (Q1 = q1Q2 = qn,C = r)|= q2 ¥ááᥠQn ¥ P (Q2 = q2 ¥ááᥠQn = qn|C = r) The second factor of above formula is recursively formulated as follows: P (Q2 = q2Q3 = qn,C = r)|= q3 ¥ááᥠQn ¥ P (Q3 = q3 ¥ááᥠQn = qn|C = r) Here, we assume that QiÕs (1 ² i ² n) are independent each other. In other words, we assume P (Q1 = q1Q2 = qn,C = r)= P (Q1 = q1C = r)|= q2¥á á á¥Qn | and so on. Then, P (Q1 = q1 ¥Q2 = q2¥á á á¥Qn = qn|C = r) equals P (Q1 = q1|C = r) ¥ P (Q2 = q2|C = r) ¥ááᥠP (Qn = qn|C = r). P (Qi|C = r) (1 ² i ² n) can be estimated from training data. The process of Bayesian learning is thus performed. 2.3 Bayesian classi¥cation Using the result of learning, we can classify and predict the value of C if the values of qi are given. P (C = rQ1 = qn) |= q1 ¥ááᥠQn n Y =( C = r)) = r)/zP (Qi = qi|¥ P (C i=1 where z is a normalizing constant. We can thus calculate the conditional probability for any given values of parameters qiÕs using this equation. From the calculated probability, we classify the data into either class, runaway or successful. Here, we determine a project is in class ÒrunawayÓ if P (C = r|Q1 = q1 ¥ááᥠQn = qn) ³ 0.5. 3. METRICS FOR PROJECT PREDICTION 3.1 Design of the Questionnaire In this study, we have investigated various works [2,4,8,12] regarding risk management and the experience of a certain company. Based on the results of this investigation, we have summarized all key risk factors and classi¥ed them into the following ¥ve viewpoints: (1) Requirements, (2) Estimations, (3) Team organization, (4) Planning capability, and (5) Project management activities. The overview of the questionnaire is shown as follows: 3.1.1 Requirements The Requirements viewpoint includes factors which are related to the understanding and commitment of the requirements among the project members. The factors for the requirements viewpoint are distinguished as follows: (1.1) Ambiguous requirements (1.2) Insu¥cient explanation of the requirements (1.3) Misunderstanding of the requirements (1.4) Lack of commitment regarding requirements between the customer and the project members (1.5) Frequent requirement changes 3.1.2 Estimations The Estimations viewpoint includes factors related to the estimation itself, the technical methods for carrying out the estimation, and the commitment between project members and customers. The factors for the estimation viewpoint are distinguished as follows: (2.1) Insu¥cient awareness of the importance of the estimation (2.2) Insu¥cient skills or knowledge of the estimation method (2.3) Insu¥cient estimation for the implicit requirements (2.4) Insu¥cient estimation for the technical issues (2.5) Lack of stakeholdersÕ commitment for estimation 3.1.3 Planning The Planning viewpoint includes factors related to the planning or scheduling activity and the commitment for the project plan among project members. The factors for the planning viewpoint are distinguished as follows: (3.1) Lack of management review for the project plan (3.2) Lack of assignment of responsibility (3.3) Lack of breakdown of the work products (3.4) Unspeci¥ed project review milestones (3.5) Insu¥cient planning of project monitoring and controlling (3.6) Lack of project membersÕ commitment for the project plan 3.1.4 Team organization The Team organization viewpoint includes factors related to the sta¥ng of the projects, the fundamental skills and experience and morale of project members. The factors for the team organization viewpoint are as follows: (4.1) Lack of skills and experience (4.2) Insu¥cient allocation of resources (4.3) Low morale 3.1.5 Project management activities The Project management activities viewpoint includes factors related to the project management activities. The factors which distinguish the project management activities viewpoint are as follows: (5.1) Project manager lack of resource management throughout a project (5.2) Inadequate project monitoring and controlling (5.3) Lack of data needed to keep objective track of a project 3.2 Result of Questionnaire The questionnaire was delivered to 40 development projects in a certain company. The responses were collected one month later. These projects are actual development of embedded systems in the industry and they were performed from 1996 to 1998. For more detail of the project data used in our study, please refer to [10, 13]. The SEPG distributed the questionnaires to the project managers or the project leaders of 40 target projects, and explained the details of the questionnaire and the purpose Table 1: Projects used for our experiment of the trial. The responses to the questionnaire were collected by the SEPG after one month. Table 1 shows the collected responses. In Table 1, the answers ÒStrongly agreeÓ, ÒAgreeÓ, ÒNeither agree nor disagreeÓ, and ÒDisagreeÓ are shown as characters Ò3Ó, Ò2Ó, Ò1Ó, and Ò0Ó, respectively. Since all of these projects completed their development, the SEPG had already identi¥ed the runaway projects according to the decision process mentioned in Section 2. As a result, 13 projects out of 40 were classi¥ed as runaway projects. Thus, the column, ÒResultÓ, in Table 1, shows the actual result of the classi¥cation. 4. EXPERIMENTAL PREDICTION In this experiment, we apply the proposed runaway prediction approach to the response data of questionnaire obtained from actual projects in a certain company. By applying our approach, we will identify the most accurate Bayesian classi¥cation approach for the prediction of runaway projects. 4.1 Empirical Data for Experiment The software development projects used in this study were for embedded software. As mentioned in subsection 3.2, we collected 40 response data of questionnaire from 40 projects shown in Table 1. 4.2 Bayes-based Data-Mining Methods Several data-mining methods based on Bayesian classi¥cation have been proposed so far in order to deal with various assumptions and environments. They are implemented in major data-mining tools. As a data-mining tool, we adopted Weka [14] in this experiment. We adopted the following methods in the experiment, which is implemented in Weka and is applicable to the data we used: 1. Naive Bayesian classi¥er: Naive-Bayes 2. Naive Bayesian classi¥er with assumption of normal distribution: Naive-Bayes(ND) 3. Naive Bayesian classi¥er with kernel estimation: Naive-Bayes(KD) 4. Bayesian Network: Bayes-Net 5. Complement Naive Bayes: Complement-Naive-Bayes 6. Multinomial Naive Bayes: Naive-Bayes-Multinomial Table 2: Result of Jackknife validation for each Bayes-based classi¥cation method Simply speaking, ÒNaive-BayesÓ is the most simple Bayesian classi¥er that can use only nominal variables. ÒNaive-Bayes(ND)Ó and ÒNaive-Bayes(KD)Ó are extended version of Naive-Bayes in order to deal with continuous variables. Since ÒBayes-NetÓ adopts network structure to represent dependencies between attribute variables, it can represent more complex relationships. ÒComplement-Naive-BayesÓ and ÒNaive-Bayes-MultinomialÓ are also extended version of Naive-Bayes to deal with uneven distribution of datasets. Since we do not have any knowledge on the accuracy of these method, we performed prediction procedure using all of them. 4.3 Jackknife Validation In order to evaluate the accuracy of prediction results, we perform cross validations on the collected data. The evaluation through k-fold cross validation method is one of the most common in machine learning community. The data set is here split into k equally sized subsets, and then in i-th iteration (i =1 k) i-th subset is used for testing the ááá classi¥er that has been learned from all other remaining subsets. Notice that each instance of data is classi¥ed exactly once. Especially, k-fold cross validation is called Jackknife validation when k is equal to the number of data in the data set. Let us show an example of Jackknife validation using the data in Table 1. First, we choose a project for testing, and thus rest 39 projects are used for learning. The selected project is then classi¥ed by the learned prediction model. For each project in Table 1, the same procedure is performed. Finally, we obtain 40 testing results to evaluate the overall accuracy of the prediction. 4.4 Result of Prediction We have applied Jackknife validation to 6 Bayes-based data-mining methods and obtained the accuracy for each method. Table 2 shows the result of experiments. In Table 2, we can see that ÒNaive-Bayes(ND)Ó has the highest accuracy (that is, 0.875) of all; it can predict 35 out of 40 projects correctly. ÒNaive-Bayes(KD)Ó has the second highest accuracy (0.850). It is remarkable that we can predict the status of projects by 87.5% accuracy on average. However, most of the Bayes-based data-mining methods used in this experiment show high accuracy. This result implies that our decision of selecting Bayesian classi¥er was relatively appropriate. 5. DISCUSSION At this point, we have relatively high accuracy on predicting runaway software projects using Bayesian Classi¥er. However, we have identi¥ed several problems as follows: 1. We cannot o¥er any suggestions for improvements. Our proposed approach can only o¥er a probability of being runaway for each project. What developers really want to know is what should be done to avoid runaway status. In order to identify the most in¥uential factors that make a project runaway, we will introduce methods to extract such factors. For example, we guess that the logistic regression analysis or decision tree analysis may be useful for risk identi¥cation. Combining the prediction of projects and the identi¥cation of risk factors is one of the most interesting future research. 2. The number of applied data is too small. Since our research is in an experimental stage, we cannot collect much more data immediately. Especially, the preparation phase (See Figure 1) needs much e¥ort and duration. We are now contacting several industries and preparing to apply our approach. Conducting much more experiments in other environments is expected in our future works. 6. CONCLUSION In this paper, we proposed an automated approach to predict runaway software projects using the Bayesian classi¥cation technique. We con¥rmed that the Bayesian classi¥cation model can be applicable to the response data of questionnaire and the prediction can be done with high accuracy. Since the data used in this study are the information regarding the ¥nal status of projects and the responses of the questionnaire about the software risks, this data can be collected easily in actual development ¥elds. As future research works, it is necessary to apply experiments to other datasets in order to obtain more generic results. It is also important to extend our approach to provide suggestions for avoiding runaway projects. Acknowledgment Authors would like to thank Dr. Yasunari Takagi of OMRON Corporation who gave us empirical data of their software development and invaluable advise to this research. This research was partially supported by the Ministry of Education, Science, Sports, and Culture in Japan, Grant-in-Aid for Scienti¥c Research(C), Grant No. 15500022 (20032005). 7. REFERENCES [1] B. W. Boehm. Industrial software metrics top 10 list. IEEE Software, 4(5):84Ð85, 1987. [2] E. H. Conrow and P. S. Shishido. Implementing risk management on software intensive projects. IEEE Software, 14(3):83Ð89, 1997. [3] P. Domingos and M. J. Pazzani. On the optimality of the simple Bayesian classi¥er under zero-one loss. Machine Learning, 29(2-3):103Ð130, 1997. [4] R. Fairley and P. Rook. Risk management for software development. In Software Engineering, pages 387Ð400. IEEE CS Press, 1997. [5] N. E. Fenton and S. L. P¥eeger. Software Metrics : A Rigorous & Practical Approach. PWS Publishing, 1997. [6] R. L. Glass. How not to prepare for a consulting assignment and other ugly consultancy truths. Communication of the ACM, 41(12):11Ð13, 1998. [7] J. Jiang and G. Klein. Software development risks to project e¥ectiveness. Journal of Systems and Software, 52:3Ð10, 2000. [8] D. W. Karolak. Software Engineering Risk Management. IEEE CS Press, CA, 1996. [9] S. . McConnell. Rapid Development. Microsoft Press, 1996. [10] O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto. Characterization of risky projects based on project managersÕ evaluation. In Proc. of 22nd International Conference on Software Engineering, pages 387Ð395, 2000. [11] J. D. Procaccino, J. M. Verner, S. P. Overmyer, and M. E. Darter. Case study: factors for early prediction of software development success. Information and Software Technology, 44:53Ð62, 2002. [12] F. J. Sisti and S. Joseph. Software risk evaluation method version 1.0. Technical Report CMU/SEI-94-TR-19, Software Engineering Institute, 1994. [13] Y. Takagi, O. Mizuno, and T. Kikuno. An empirical approach to characterizing risky software projects based on logistic regression analysis. Empirical Software Engineering, (to appear). [14] Weka Machine Learning Project. Weka 3: Data mining software in java. http://www.cs.waikato.ac.nz/ ml/weka/. [15] R. C. Williams, G. J. Pandelios, and S. G. Behrens. Software risk evaluation (SRE) method description (version 2.0). Technical Report CMU/SEI-99-TR-029, Software Engineering Institute, 1999. [16] C. Wohlin and A. A. Andrews. Prioritizing and assessing software project success factors and project characteristics using subjective data. Empirical Software Engineering, 8:285Ð303, 2003.