An Empirical Approach to Predict Runaway Software Projects Using Bayesian Classication Osame Mizuno Tohru Kikuno Graduate School of Information Science and Graduate School of Information Science and Technology Technology Osaka University Osaka University 1-5 Yamadaoka, Suita, Osaka, Japan 1-5 Yamadaoka, Suita, Osaka, Japan o-mizuno@ist.osaka-u.ac.jp kikuno@ist.osaka-u.ac.jp ABSTRACT This paper proposes an automated approach to identify and predict a runaway status of software development projects at an early stage of development processes. So far, we have proposed an questionnaire-based approach using the logistic regression analysis to identify risk factors empirically in software projects. Although the risk factors were successfully identied from the questionnaire in the previous approach, the prediction of runaway projects still had several problems. In order to solve the problems and to construct an automated prediction system, we adopt the Bayesian classication methods for the prediction in such a way that the projects are classied into runaway or successful ones by applying the Bayesian classier to the responses of the questionnaire. The result of experiments using actual project data shows that the proposed approach can predict 87.5% of projects correctly. Keywords: Software project prediction, Data-mining methods, Project management 1. INTRODUCTION Software development projects are required to produce highly reliable systems within a short period and at low cost. This contradictory requirements make development projects risky ones. In fact, many studies pointed out increase of runaway projects in software development eld [6,9]. Problems in software development must be detected and avoided as early as possible. Thus, detecting signs of problems at an early stage of the software project is important. If the detection of a problem is delayed, it becomes more dicult to x a problem since the eort of coping with a problem increases exponentially [1]. Much research has been carried out about the detection of problem signs of software development projects [7, 16]. A case study of success prediction has been carried out in [11]. Concerns for risk management are increasing for early detection of such problem signs in software projects. The Software Risk Evaluation method (SRE) is a risk-management technique for a software development project [15]. In the SRE, the projects risks are identied using the taxonomy table of software risks. The risk taxonomy table is very useful for systematically identifying risks of a project. However, since many risk attributes exists in the identication of risks using the taxonomy table, the extraction of a risk takes time. Therefore, SRE recommends to carrying out the tailoring of the taxonomy table for each project. Risks of a software development project are inuenced by environments such as the domain, the business style, the culture of the organization, and human characteristics. In the projects with similar environments, an approach to prevent recurrence of problems by analyzing past problems is usually taken. Such an approach is easily understood by the project members, since it is based on the actually occurring problem. We have proposed an approach to characterize a runaway software projects using logistic regression analysis [10, 13]. In this approach, the risk factors are identied successfully and the experimental prediction shows relatively good result. However, the approach relays on the manual selection of attribute variables. Thus the prediction was not done automatically. In order to construct an automated system to predict runaway software projects, we try to adopt Bayesian classication to the prediction system. The naive Bayesian classication is known as optimal among a lot of supervised learning methods when the values of the attributes of an example data are independent given the class of the example data. Since there are many extended versions of Bayesian classier, it is necessary to identify the most accurate one for practical use. In this paper, we compare the accuracy of different versions of Bayesian classier for actual project data given by a certain company. By adopting the most accurate Bayesian classier that we nd out, the prediction system can achieve 87.5% of accuracy on average for applied data. 2. AUTOMATED PREDICTION OF SOFTWARE PROJECTS Figure 1: Overview of our approach: Preparation phase Figure 2: Overview of our approach: Prediction phase 2.1 Overview Figure 1 shows an overview of a preparation phase of our approach to make a prediction model of runaway projects. In Step 1, we designed a questionnaire to be distributed to project managers and leaders in order to collect the assessment data. The questionnaire is constructed based on our previous works and various software risk research. The SEPG distributed the questionnaire to project members, and asked them to ll out the questionnaire. Next, in Step 2, after the project members nished lling out the questionnaire, the SEPG collect and store them into the project database. At the same time, in Step 3, SEPG determined the success variable and determine the values of them for each development project from available project data. In the prediction phase, as shown in Figure 2, we perform prediction of runaway projects in the following steps: First, in Step 2, the project members also ll out the questionnaire, and then, the SEPG collect them. Next, in Step 4, we apply a data-mining technique (in the case of this study, Bayesian classier) to project database and construct a prediction model to identify the runaway projects. Finally, we apply the prediction model to the collected responses to the questionnaire and obtain a result of prediction, that is, whether a project is runaway or successful. The proposed approach is now implementing in our laboratory. The implemented system will automate the collection of questionnaire, constructing prediction model, and the feedback of predicted results. The questionnaire consists of 22 questions to ask for possible signs of problems in software development. Each question must be lled in the ordered scale [5]. The detail of designing questionnaire will be described in Section 3.1. In our previous research [10], we constructed a logistic model with ve risk factors used for the parameters. In this paper, we change the modeling technique with more generic one, that is the Bayesian classier, since it is more robust approach to apply more empirical data from various industries. 2.2 Naive Bayesian Classier The naive Bayesian classication is known as optimal among a lot of supervised learning methods when the values of the attributes of an example data are independent given the class of the example data. Although this assumption is almost always violated in practice, recent research has shown that naive Bayesian learning is also eective in practice [3]. We introduce several fundamental concepts briey. 2.2.1 Bayesian learning Let Q1,Q2, ,Qn be the parameters with discrete val ues to predict a discrete class C. For example, Qi denotes the questionnaire and C denotes the nal status of a project: runaway(r) and successful(s). Suppose that the values q1,q2, ,qn are given to these parameters, and the optimal prediction is the class value C = r such that P (C = r|Q1 = q1 Q2 = q2 Qn = qn) is maximum. By the Bayes theorem, this probability is expressed as follows: P (Q1 = q1 Q2 = q2 Qn = qn|C = r) P (C = r). P (Q1 = q1 Q2 = q2 Qn = qn) The probability P (C = r) can be easily estimated from training data. Furthermore, P (Q1 = q1 Q2 = q2 Qn = qn) is irrelevant to the class variable C. Therefore, learning is reduced to the problem of estimating P (Q1 = q1 Q2 = q2 Qn = qn|C = r) from training data. Using Bayes theorem again, this conditional probability can be written as follows: P (Q1 = q1Q2 = qn,C = r)|= q2 Qn P (Q2 = q2 Qn = qn|C = r) The second factor of above formula is recursively formulated as follows: P (Q2 = q2Q3 = qn,C = r)|= q3 Qn P (Q3 = q3 Qn = qn|C = r) Here, we assume that Qis (1 i n) are independent each other. In other words, we assume P (Q1 = q1Q2 = qn,C = r)= P (Q1 = q1C = r)|= q2 Qn | and so on. Then, P (Q1 = q1 Q2 = q2 Qn = qn|C = r) equals P (Q1 = q1|C = r) P (Q2 = q2|C = r) P (Qn = qn|C = r). P (Qi|C = r) (1 i n) can be estimated from training data. The process of Bayesian learning is thus performed. 2.3 Bayesian classication Using the result of learning, we can classify and predict the value of C if the values of qi are given. P (C = rQ1 = qn) |= q1 Qn n Y =( C = r)) = r)/zP (Qi = qi| P (C i=1 where z is a normalizing constant. We can thus calculate the conditional probability for any given values of parameters qis using this equation. From the calculated probability, we classify the data into either class, runaway or successful. Here, we determine a project is in class runaway if P (C = r|Q1 = q1 Qn = qn) 0.5. 3. METRICS FOR PROJECT PREDICTION 3.1 Design of the Questionnaire In this study, we have investigated various works [2,4,8,12] regarding risk management and the experience of a certain company. Based on the results of this investigation, we have summarized all key risk factors and classied them into the following ve viewpoints: (1) Requirements, (2) Estimations, (3) Team organization, (4) Planning capability, and (5) Project management activities. The overview of the questionnaire is shown as follows: 3.1.1 Requirements The Requirements viewpoint includes factors which are related to the understanding and commitment of the requirements among the project members. The factors for the requirements viewpoint are distinguished as follows: (1.1) Ambiguous requirements (1.2) Insucient explanation of the requirements (1.3) Misunderstanding of the requirements (1.4) Lack of commitment regarding requirements between the customer and the project members (1.5) Frequent requirement changes 3.1.2 Estimations The Estimations viewpoint includes factors related to the estimation itself, the technical methods for carrying out the estimation, and the commitment between project members and customers. The factors for the estimation viewpoint are distinguished as follows: (2.1) Insucient awareness of the importance of the estimation (2.2) Insucient skills or knowledge of the estimation method (2.3) Insucient estimation for the implicit requirements (2.4) Insucient estimation for the technical issues (2.5) Lack of stakeholders commitment for estimation 3.1.3 Planning The Planning viewpoint includes factors related to the planning or scheduling activity and the commitment for the project plan among project members. The factors for the planning viewpoint are distinguished as follows: (3.1) Lack of management review for the project plan (3.2) Lack of assignment of responsibility (3.3) Lack of breakdown of the work products (3.4) Unspecied project review milestones (3.5) Insucient planning of project monitoring and controlling (3.6) Lack of project members commitment for the project plan 3.1.4 Team organization The Team organization viewpoint includes factors related to the stang of the projects, the fundamental skills and experience and morale of project members. The factors for the team organization viewpoint are as follows: (4.1) Lack of skills and experience (4.2) Insucient allocation of resources (4.3) Low morale 3.1.5 Project management activities The Project management activities viewpoint includes factors related to the project management activities. The factors which distinguish the project management activities viewpoint are as follows: (5.1) Project manager lack of resource management throughout a project (5.2) Inadequate project monitoring and controlling (5.3) Lack of data needed to keep objective track of a project 3.2 Result of Questionnaire The questionnaire was delivered to 40 development projects in a certain company. The responses were collected one month later. These projects are actual development of embedded systems in the industry and they were performed from 1996 to 1998. For more detail of the project data used in our study, please refer to [10, 13]. The SEPG distributed the questionnaires to the project managers or the project leaders of 40 target projects, and explained the details of the questionnaire and the purpose Table 1: Projects used for our experiment of the trial. The responses to the questionnaire were collected by the SEPG after one month. Table 1 shows the collected responses. In Table 1, the answers Strongly agree, Agree, Neither agree nor disagree, and Disagree are shown as characters 3, 2, 1, and 0, respectively. Since all of these projects completed their development, the SEPG had already identied the runaway projects according to the decision process mentioned in Section 2. As a result, 13 projects out of 40 were classied as runaway projects. Thus, the column, Result, in Table 1, shows the actual result of the classication. 4. EXPERIMENTAL PREDICTION In this experiment, we apply the proposed runaway prediction approach to the response data of questionnaire obtained from actual projects in a certain company. By applying our approach, we will identify the most accurate Bayesian classication approach for the prediction of runaway projects. 4.1 Empirical Data for Experiment The software development projects used in this study were for embedded software. As mentioned in subsection 3.2, we collected 40 response data of questionnaire from 40 projects shown in Table 1. 4.2 Bayes-based Data-Mining Methods Several data-mining methods based on Bayesian classication have been proposed so far in order to deal with various assumptions and environments. They are implemented in major data-mining tools. As a data-mining tool, we adopted Weka [14] in this experiment. We adopted the following methods in the experiment, which is implemented in Weka and is applicable to the data we used: 1. Naive Bayesian classier: Naive-Bayes 2. Naive Bayesian classier with assumption of normal distribution: Naive-Bayes(ND) 3. Naive Bayesian classier with kernel estimation: Naive-Bayes(KD) 4. Bayesian Network: Bayes-Net 5. Complement Naive Bayes: Complement-Naive-Bayes 6. Multinomial Naive Bayes: Naive-Bayes-Multinomial Table 2: Result of Jackknife validation for each Bayes-based classication method Simply speaking, Naive-Bayes is the most simple Bayesian classier that can use only nominal variables. Naive-Bayes(ND) and Naive-Bayes(KD) are extended version of Naive-Bayes in order to deal with continuous variables. Since Bayes-Net adopts network structure to represent dependencies between attribute variables, it can represent more complex relationships. Complement-Naive-Bayes and Naive-Bayes-Multinomial are also extended version of Naive-Bayes to deal with uneven distribution of datasets. Since we do not have any knowledge on the accuracy of these method, we performed prediction procedure using all of them. 4.3 Jackknife Validation In order to evaluate the accuracy of prediction results, we perform cross validations on the collected data. The evaluation through k-fold cross validation method is one of the most common in machine learning community. The data set is here split into k equally sized subsets, and then in i-th iteration (i =1 k) i-th subset is used for testing the classier that has been learned from all other remaining subsets. Notice that each instance of data is classied exactly once. Especially, k-fold cross validation is called Jackknife validation when k is equal to the number of data in the data set. Let us show an example of Jackknife validation using the data in Table 1. First, we choose a project for testing, and thus rest 39 projects are used for learning. The selected project is then classied by the learned prediction model. For each project in Table 1, the same procedure is performed. Finally, we obtain 40 testing results to evaluate the overall accuracy of the prediction. 4.4 Result of Prediction We have applied Jackknife validation to 6 Bayes-based data-mining methods and obtained the accuracy for each method. Table 2 shows the result of experiments. In Table 2, we can see that Naive-Bayes(ND) has the highest accuracy (that is, 0.875) of all; it can predict 35 out of 40 projects correctly. Naive-Bayes(KD) has the second highest accuracy (0.850). It is remarkable that we can predict the status of projects by 87.5% accuracy on average. However, most of the Bayes-based data-mining methods used in this experiment show high accuracy. This result implies that our decision of selecting Bayesian classier was relatively appropriate. 5. DISCUSSION At this point, we have relatively high accuracy on predicting runaway software projects using Bayesian Classier. However, we have identied several problems as follows: 1. We cannot oer any suggestions for improvements. Our proposed approach can only oer a probability of being runaway for each project. vWhat developers really want to know is what should be done to avoid runaway status. In order to identify the most inuential factors that make a project runaway, we will introduce methods to extract such factors. For example, we guess that the logistic regression analysis or decision tree analysis may be useful for risk identication. Combining the prediction of projects and the identication of risk factors is one of the most interesting future research. 2. The number of applied data is too small. Since our research is in an experimental stage, we cannot collect much more data immediately. Especially, the preparation phase (See Figure 1) needs much eort and duration. We are now contacting several industries and preparing to apply our approach. Conducting much more experiments in other environments is expected in our future works. 6. CONCLUSION In this paper, we proposed an automated approach to predict runaway software projects using the Bayesian classication technique. We conrmed that the Bayesian classication model can be applicable to the response data of questionnaire and the prediction can be done with high accuracy. Since the data used in this study are the information regarding the nal status of projects and the responses of the questionnaire about the software risks, this data can be collected easily in actual development elds. As future research works, it is necessary to apply experiments to other datasets in order to obtain more generic results. It is also important to extend our approach to provide suggestions for avoiding runaway projects. Acknowledgment Authors would like to thank Dr. Yasunari Takagi of OMRON Corporation who gave us empirical data of their software development and invaluable advise to this research. This research was partially supported by the Ministry of Education, Science, Sports, and Culture in Japan, Grant-in-Aid for Scientic Research(C), Grant No. 15500022 (20032005). 7. REFERENCES [1] B. W. Boehm. Industrial software metrics top 10 list. IEEE Software, 4(5):8485, 1987. [2] E. H. Conrow and P. S. Shishido. Implementing risk management on software intensive projects. IEEE Software, 14(3):8389, 1997. [3] P. Domingos and M. J. Pazzani. On the optimality of the simple Bayesian classier under zero-one loss. Machine Learning, 29(2-3):103130, 1997. [4] R. Fairley and P. Rook. Risk management for software development. In Software Engineering, pages 387400. IEEE CS Press, 1997. [5] N. E. Fenton and S. L. Peeger. Software Metrics : A Rigorous & Practical Approach. PWS Publishing, 1997. [6] R. L. Glass. How not to prepare for a consulting assignment and other ugly consultancy truths. Communication of the ACM, 41(12):1113, 1998. [7] J. Jiang and G. Klein. Software development risks to project eectiveness. Journal of Systems and Software, 52:310, 2000. [8] D. W. Karolak. Software Engineering Risk Management. IEEE CS Press, CA, 1996. [9] S. . McConnell. Rapid Development. Microsoft Press, 1996. [10] O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto. Characterization of risky projects based on project managers evaluation. In Proc. of 22nd International Conference on Software Engineering, pages 387395, 2000. [11] J. D. Procaccino, J. M. Verner, S. P. Overmyer, and M. E. Darter. Case study: factors for early prediction of software development success. Information and Software Technology, 44:5362, 2002. [12] F. J. Sisti and S. Joseph. Software risk evaluation method version 1.0. Technical Report CMU/SEI-94-TR-19, Software Engineering Institute, 1994. [13] Y. Takagi, O. Mizuno, and T. Kikuno. An empirical approach to characterizing risky software projects based on logistic regression analysis. Empirical Software Engineering, (to appear). [14] Weka Machine Learning Project. Weka 3: Data mining software in java. http://www.cs.waikato.ac.nz/ ml/weka/. [15] R. C. Williams, G. J. Pandelios, and S. G. Behrens. Software risk evaluation (SRE) method description (version 2.0). Technical Report CMU/SEI-99-TR-029, Software Engineering Institute, 1999. [16] C. Wohlin and A. A. Andrews. Prioritizing and assessing software project success factors and project characteristics using subjective data. Empirical Software Engineering, 8:285303, 2003.