Finding the Right Data for Software Cost Modeling Zhihao Chen, Tim Menzies, Dan Port, Barry Boehm Center for Software Engineering, University. of Southern California Computer Science, Portland State University Computer Science, University of Hawaii zhihaoch@cse.usc.edu; tim@timmenzies.net dport@hawaii.edu; boehm@cse.usc.edu ABSTRACT Strange to say, when building a software cost model, sometimes it is useful to ignore much of the available cost data. Introduction Good software cost models can significantly help the managers of software projects. With good models, project stakeholders can make informed decisions about how to manage resources; how to control and plan the project; or how to deliver the project on time, on schedule and on budget. Off-the-shelf ``untuned'' models have been up to 600 inaccurate in their estimates . Hence, the wise manager uses a cost model built from local data. But what data should we use to build a good cost model? Real-world data sets, such as what comes from software engineering projects, often contain noisy or irrelevant or redundant variables. Prior to getting the data, it is hard to know what parts of the data are most important. However, once a database is available, automatic tools can be used to prune the data back to the most important values. Therefore this paper proposes a change to current practice. Often, cost models are built using all available data. Here, we propose that after data collection and before model building, cost modelers should perform some data pruning experiments. As shown below, such pruning experiments are simple and fast to perform. Our process starts with a table of historical data divided into columns and rows. Each column is a different variable describing some aspect of a software project. Each row shows data from different software sub-systems, so one project can contribute many rows. For example, eg shows data from 17 variables and rows from three sub-systems from two projects. The data in such a table can be pruned by removing columns or removing rows: In row pruning (also known as stratification), rows from related projects are collected together and different cost models are learned from these different subsets. In column pruning (also known as feature subset selection), the columns are sorted left-to-right according to their ``usefulness''; i.e. how well that column's variable predicts for the target variable (in our case, software development effort). Column pruning then proceeds left to right across the sorted columns, each time removing some less-useful left-hand-side columns. At each step of the pruning, a cost model is learned from the remaining columns. The benefits of row pruning have been reported previously. For example, Shepperd and Schofeld report experiments with row pruning where estimator performance improved by up to 28 (measured in terms of the ``PRED'' measure described later in this article), However, and this is the point of this paper, we find that much larger improvements result from pruning both rows and columns. For example, the experiments presented here include one data set called where estimator performance improved from 15 to 97. Further, these large improvements were seen when most of the columns are pruned away. For example, in our experiments, column pruning removed 65 of all columns (on average). Surprisingly, when building a software cost model, it is usually useful to ignore over half of the available cost data. More importantly, row and column pruning leads to the largest improvements in estimator performance in the smallest training sets (less than 30 examples). This is a result of tremendous practical significance. Modern software practices change so rapidly that most organizations can't access large databases of relevant project data. Our results suggest that this is not necessarily a problem, provided models are learned via row and column pruning. This rest of this paper describes experiments with our tool for column pruning. This tool is a set of UNIX scripts that use the WRAPPER variable subtraction algorithm from the public-domain WEKA data mining toolkit . The tool is fully automated, runs on a standard LINUX installation, and is available from the authors. Background This paper applies column pruning methods to cost estimation. These pruning methods were evolved by the data mining community. While these methods have been explored extensively , there is very little prior work on column pruning and software cost modeling. To the best of our knowledge, the only other work on column pruning for cost estimation was a limited experiment by Kirsopp and Shepperd . Like this study, they found that column pruning significantly improves effort estimation. However, their experimental base was much more restrictive than this study (they ran only two data sets while we ran 15). Also, unlike our work, their experiment is not reproducible. The Kirsopp Shepperd data sets are not public domain while all our COCOMO-I data sets can be downloaded from the PROMISE repository . An earlier draft of this paper appeared previously in a workshop publication . This paper extends that earlier draft in two ways. Firstly, this paper explores more data than before (double the number of COCOMO-I project analyzed, two new COCOMO-II data sets). Secondly, it includes an expanded discussion on the business implications of column pruning. Pruning: Why? The case for pruning rows is quite simple. Software projects are not all the same. For example, real-time safety critical systems are very different to batch financial processors. Given a database of different kinds of software, it is just good sense to divide up the rows into different project types and learn different cost models for each type. Then, in the future, managers can use different cost models depending on what type of software they are developing. The case for pruning columns is slightly more complex. If a learned cost model uses all the variables in the database then the only way to use that cost model on future sub-systems is to collect information on all those variables. In many business situations, the cost of reaching some goal is a function of how much data you have to collect or monitor along the way. But if the learned model uses only some of the variables, then using that model in the future means collecting less data. This would be useful in several scenarios. For example, when monitoring an out-sourced project at a remote site, it is useful to minimize the reporting requirements to just the variables that matter the most. Such a reporting structure reduces the overhead in managing a contract. From a technical perspective, there are also good reasons to subtract variables: [Under-sampling:] The number of possible influences on a project is quite large and, usually, historical data sets on projects for a particular company are quite small. Hence, a variable that is theoretically useful may be practically useless. For example, dels shows how in na60 data set, nearly all those those NASA projects were rated as having a high complexity (see dels). Therefore, this data set would not support conclusions about the interaction of, say, extra high complex projects with other variables. A learner would be wise to subtract this variable (and a cost modeling analyst would be wise to suggest to their NASA clients that they refine the local definition of ``complexity''). [Reducing Variance:] Miller offers an extensive survey of column pruning for linear models and regression . That survey includes a very strong argument for column pruning: the variance of a linear model learned by minimizing least squares error decreases as the number of columns in the model. That is, the fewer the columns, the more restrained are the model predictions. [Irrelevancy:] Sometimes, modelers are incorrect in their beliefs about what variables effect some outcome. In this case, they might add irrelevant variables to a database. Without column pruning, a cost model learned from that database might contain these irrelevant variables. Anyone trying to use that model in the future would then be forced into excessive data collection. [Noise:] Learning a cost estimation model is easier when the learner does not have to struggle with fitting the model to confusing noisy data (i.e. when the data contains spurious signals not associated with variations to projects). Noise can come from many sources such clerical errors or missing data. For example, organizations that only build word processors may have little data on software projects with high reliability requirements. [Correlated variables:] If multiple variables are tightly correlated, then using all of them will diminish the likelihood that either variable attains significance. A repeated result in data mining is that pruning away some of the correlated variables increases the effectiveness of the learned model (the reasons for this are subtle and vary according to which particular learner is being used ). Pruning: How? The column pruning method used in this study was is called the ``WRAPPER'' . The WRAPPER selects combinations of columns and asks some learner to build a cost model using just those columns. The WRAPPER then grows the selected columns and checks if a better model comes from learning over the larger set of columns. The WRAPPER stops when there are no more columns to select, or there has been no significant improvement in the learned model for the last five additions (in which case, those last five additions are deleted). Technically speaking, this is a forward select search with a ``stale'' parameter set to 5. WRAPPER is thorough but, theoretically, it is quite slow since (in the worst case), it has to explore all subsets of the available columns. However, all the data sets in this study are quite small. Our experiments only required around 20 minutes per data set. We use the WRAPPER since other experiments by other researchers strongly suggest that it is superior to many other column pruning methods. For example, Hall and Holmes compare the WRAPPER to several other column pruning methods including principal component analysis (PCA- a widely used technique). Column pruning methods can be grouped according to: Whether or not they make special use of the target attribute in the data set such as ``development cost''; Whether or not they use the target learner as part of their pruning. PCA is unique since it does not make special use of the target attribute. WRAPPER is also unique, but for different reasons: unlike other pruning methods, it does use the target learner as part of its analysis. Hall and Holmes found that PCA was one of the worst performing methods (perhaps because it ignored the target attribute) while WRAPPER was the best (since it can exploit its special knowledge of the target learner). Cost Modeling with COCOMO Before pruning data, we first need to understand how cost models might use that data. This study uses COCOMO for our cost modeling. COCOMO stands for Constructive Cost Model . COCOMO helps software developers reason about the cost and schedule implications of their software decisions such as software investment decisions; setting project budgets and schedules; negotiating cost, schedule, and performance tradeoffs; making software risk management decisions, and making software improvement decisions. One advantage of COCOMO (and this is why we use it) is that unlike other many other costing models such as SLIM or SEER, COCOMO is an open model with numerous published data . There are two versions of COCOMO: COCOMO-I and COCOMO-II. In going from the 1981 COCOMO-I model to the 2000 COCOMO-II model , one parameter, ``Turnaround Time'', was dropped to reflect the almost-universal use of interactive software development. Another parameter, ``Modern Programming Practices'', was dropped in favor of a more general ``Process Maturity'' parameter. But several more parameters were added to reflect the subsequently-experienced influences of such factors as ``Development for Reuse'', ``Multisite Development'', ``Architecture and Risk Resolution'', and ``Team Cohesion''. The COCOMO-II book also provides capabilities and guidelines for an organization to add new parameters, reflecting their particular situations. COCOMO measures effort in calendar months where one month is 152 hours (and includes development and management hours). The core intuition behind COCOMO-based estimation is that as systems grow in size, the effort required to create them grows exponentially. More specifically, one shows the COCOMO I model : Here, is one of 15 effort multipliers such as cplx (complexity) or pcap (programmer capability). The COCOMO-I effort multipliers are shown defined in em and their numeric values are shown in effortmults. In COCOMO-II, the number of effort multipliers changed from 15 to 17. In the COCOMO-I model, and are domain-specific variables and KSLOC (thousands of lines of non-commented source code) is estimated directly or computed from a function point analysis. In COCOMO-II, was expanded to include scale factors: where is one of five scale factors that exponentially influence effort. Examples of scale factors include pmat (process maturity) or resl (attempts to resolve project risks). A standard method for assessing COCOMO performance is PRED(30). PRED(30) is calculated from the relative error, or RE, which is the relative size of the difference between the actual and estimated value: The mean magnitude of the relative error, or MMRE, is the average percentage of the absolute values of the relative errors over an entire data sets. PRED(N) reports the average percentage of estimates that were within N of the actual values. If a data set has rows, then: For example, a PRED(30)=50 means that half the estimates are within 30 of the actual. Note that, we report results in terms of PRED(N), not MMRE. This is a pragmatic decision- we have found PRED(N) easier to explain to business users than MMRE. Also, there are more PRED(N) reports in the literature than MMRE. This is perhaps due to the influence of the COCOMO researchers who reported their 1999 landmark study using PRED(N) . Further, we report here PRED(30) results since the major experiments of that 1999 study also used PRED(30). The results for PRED(25) are similar with those for PRED(30) except that the accuracies for PRED(30) is a little better than those for PRED(25). In order to use the linear least squares regression, which is the most widely used and the simplest modeling method, it is common to transform COCOMO model into linear model by taking the logs of one. Note that, if twol is used, then before computing PRED(N), the estimated effort has to be converted back from a logarithm. Case Study Data This study uses datasets in both COCOMO-I and COCOMO-II format: The cii0 data set was used to build the COCOMO-II model. The cii4 data set is also in the COCOMO-II format and includes the 72 projects from cii0 developed after 1990, plus 47 new projects. The COCOMO-II data is not published since it was collected on condition of confidentiality with the companies supplying the data. Further research must be conducted in terms of the same conditions. In contrast, several COCOMO-I data sets are available in the PROMISE repository : Coci comes from the COCOMO-I text and includes data from a variety of domains including engineering, science, financial, etc. na60 comes from 20 years of NASA projects and is recorded using the COCOMO-I variables. The na60 data contains the following subsets: c01,c02,c03 store data from three different NASA geographical locations; p02,p03,p04 stores data from three different NASA projects; t02,t03 stores data from two tasks such as ground data receiving and flight guidance. For reasons of confidentiality, the exact details of those centers, tasks, and projects cannot be disclosed. The other centers, projects, and tasks from na60 were not included in cX, pX, or tX for a variety of pragmatic reasons (e.g. suspicious repeated entries suggesting data entry errors, too few examples for generalization, etc). Of these data sets, coci describes projects from before 1982; cii4 contain data from the most recent projects; and the NASA data sets (na60, pX,cX,tX) described projects newer than coci and before cii4. Also, the COCOMO-I data sets () have the 15 columns of em COCOMO-II data sets () have 24 columns. For experimental purposes, we group the above as follows: call combines all the center data; i.e. ; tall combines all the task data; i.e. ; pall combines all the project data; i.e. ; Experimental Method Having described COCOMO, the WRAPPER, and our data sets, we can now describe how an analyst can use them all together to find better cost models. Column pruning using the WRAPPER was discussed above. Recall that the WRAPPER sorts columns are sorted left-to-right according how well that column's variable predicts for the target variable (in our case, software development effort). Column pruning then proceeds left to right across the sorted columns, each time removing some less-useful left-hand-side columns. At each step of the pruning, a cost model is learned from the remaining columns. Column pruning stops when removing more columns does not improve the best result seen so far. To find the relative value of each column, we ran the WRAPPER ten times (each time using a randomly selected 90 of the rows). The value of a column was then set to the number of times the WRAPPER selected that column. Once the WRAPPER ordered the columns, we randomized the order of the rows and starting column pruning. To ensure statistical validity, randomization (followed by column pruning) was repeated 30 times. For each repeat, at each stage of the pruning, the lowest value column was removed. The rows in the remaining columns were then divided into training and test sets (each time using a randomly selected 67 of the rows for the training set). A cost model was learned (using linear regression) from the training set and then assessed, using PRED, on the test set. Once the 30 repeats were completed, a best model was selected by looking at the mean and standard deviation of model performance at each pruning step. The best model was the one that t-tests confirmed out-performed all the other states of the column pruning. The mean value of that model over the 30 repeats was then reported. The above process is fully automated using our own Unix scripts which control a JAVA data mining library called the WEKA . The WEKA comes complete with a linear regression learner and an implementation of the WRAPPER. The whole system is available from the authors. Results The results of our column pruning are shown in prunep. The red and green lines of show the number of columns in our data sets before and after pruning. The data sets are on the x-axis. The values are 22 for the COCOMO-II data sets ( and ) while the values for the COCOMO-I data sets are all 15. The blue line shows the percentage of the columns removed by pruning. For example, very little was pruned from while most of the columns were pruned from . On average, over 65 of the columns were pruned. Sometimes, the pruning was quite heavy with over 80 of the columns pruned away. A concern with such large scale pruning is that the resulting models would be somehow sub-standard. This proved not to the case. The PRED(30) results associated with the pruned data sets are shown in rowp. This figure shows mean values in 30 experiments where the learned model was tested on rows not seen during training. Note that pruning always improved PRED. The red lines on rowp show the mean PRED(30) seen in 30 trials using all the columns. These are the baseline results for learning cost models before applying any pruning method. The green lines show the best mean PRED(30) seen after automatic column pruning. The difference between the red and the green lines is the improvement produced by pruning. The data sets are sorted by pruning method into three plots. Within each plot, the data sets are sorted left-to-right in increasing value of . The x-axis shows the names of the data sets used in these studies. The blue lines show the number of rows in each data set. The plots of rowp have three labels: ``pruning just columns'', ``pruning columns and some rows'', and ``pruning columns and many rows''. The data sets are sorted into three labels according to their stratification. The left-hand-side results, labeled ``pruning just columns'', come from the largest data sets that combing project information form many sources (i.e. ). These data sets are not divided up into data from similar sources. Hence, there is no row pruning used on these data sets. The right-hand-side results, labeled ``pruning columns and many rows'', come from eight data sets that have been heavily stratified into just specific NASA centers (i.e. ); or specific NASA projects (i.e. ); or specific NASA software tasks (i.e. ). The middle results, labeled ``pruning columns and some rows'', show data sets that have been somewhat stratified. The data sets in this group combine together data from either: All the NASA centers (i.e. ); Or all the NASA projects (i.e. ); Or all the NASA tasks (i.e. ). This middle group samples a point half-way between the unstratified data sets on the left and the heavily stratified data sets on the right. Three are three features of rowp: Pruning always improved estimation effectiveness. That is, in all our case studies, it was always useful to ignore a portion of the available data. Column pruning by itself can offer some improvements to PRED. However, column pruning combined with row pruning can result in dramatic improvements of effort estimation. With one exception (), the general trend across the three graphs is clear: as data set size shrinks, the improvement increases. That is, pruning is most important when dealing with small data sets. When Not to Prune While column pruning is clearly useful, sometimes it cannot or should not be applied. Firstly, our variable subtraction methods require an historical database of projects. If there is no such database, then our column pruning techniques won't work. Secondly, even if a historical database exists and our techniques suggest pruning variable , then it may still be important to ignore that advice. If a cost model ignores certain effects that business users believe are important, then those users may not trust that model. In that case, even if a variable has no noticeable impact on predictions, it should be left in the model. By leaving such variables in a model we are acknowledging that, in many domains, expert business users hold in their head more knowledge that what may be available in historical databases. Suppose that there is some rarely occurring combination of factors which leads to a major productivity improvement. Even if there is little data on some situation, it still should be included in the model. For example, even though some studies have shown that reduced-parameter function point counting rules are equally good in most situations , COCOMO II supports the full International Function Point Users' Group (IFPUG) set of parameters due to their wide usage and acceptance in the IFPUG community. Thirdly, another reason not to prune variables is that you still might need them. For example, the experiments shown above often subtract over half of the attributes in a COCOMO-I model while (usually) improve effort estimation. However, suppose that a business decision has to be made using some of the pruned variables. The reduced model has no information on those subtracted variables so a business user would have to resort to other information for making their decisions. Hence, we propose using column pruning with some care. If there is no historical data for learning specialized data sets, then managers should use the general background knowledge within COCOMO. The 1981 regression co-efficients of COCOMO-I or the updated co-efficients of COCOMO-II are the best general-purpose indicators we can currently offer for cost estimation. Management decisions can use that public knowledge to make software process decisions. For example, according to the coefficients on the COCOMO-II pmat variable, the increase in cost between a CMM3 and CMM4 project contain lines of code is . With this estimate in hand, a business user could then make their own assessment about the cost of increased software process maturity vs the benefits of that increase. If historical data from the local site is available, then managers could tune the general COCOMO background knowledge by adjusting the coefficients within the COCOMO equations. COCOMO-I and COCOMO-II contain with several local calibration variables that can quickly tune a model to local project data. Our experience has been that 10 to 20 projects are adequate to achieve such tunings . Local calibration is a simple tuning method that is supported by many tools e.g. http://sunset.usc.edu/available_tools/index.html. Currently, our toolkit methods requires more effort (i.e. some UNIX scripting) than local calibration. rowp suggests that the extra effort may well be worthwhile, particularly when building models from a handful of projects. Also, we have found that it is easier to extrapolated costs from old projects to new projects with reduced variable sets . Nevertheless, column pruning is not appropriate when there are business reasons to use all available variables (e.g. the three reasons described above). Conclusion The specific goal of this paper was to encourage more column pruning in cost modeling, particularly when dealing with very small data sets. The improvements seen in rowp seem quite impressive. Row and column and pruning, in combination, are very useful for cost modeling- particularly when dealing with small data sets. Hence, we propose a change to current practice. Cost models should not be built using all available data. Rather, after data collection and before model building, cost modelers perform some data pruning experiments. The more general goal of our work is to encourage repeatable, refutable, and improvable experiments in software engineering. To that end, as much as possible, we use public domain tools and public domain data sets. Hence, this paper uses a open source cost model (COCOMO) and, as much as possible, publicly available data. All the COCOMO-I data sets used in this study can be downloaded from the PROMISE repository . We urge other researchers to produce more results based on open source models and data sets. *Acknowledgments The advice of the anonymous reviewers helped to clarify an earlier draft of this paper. Helen Burgess offered invaluable editorial assistance. *Biographies l1in [width=1in]scott.pdf Zhihao Chen is a Research Assistant at the Center for Software Engineering and a PhD candidate of Computer Science Department at the University of Southern California. His research interests lie in system software engineering, models development and integration in general. Particularly, he focus on software cost estimation, product line investment models, process modeling and risk management for software application development. He also investigate empirically based Software Engineering - empirical methods and model integration. His research supports the generation of an empirically based software development process covering high level lifecycle models to low level techniques, provide validated guidelines/knowledge for selecting techniques and models and serves, and help people better understand such issues as what variables affect cost, reliability, and schedule, and integrating existing data and models from the participants and all collaborators. Previously, he got his bachelor and master of computer science from South China University of Technology. He previously worked for HP, CA and EMC. He can reach on zhihaoch@usc.edu. l1in [width=1in]timbeach.pdf Dr. Tim Menzies is a associate research professor at Portland State University in the United States, and has been working with NASA on software quality issues since 1998. He has a CS degree and a PhD from the University of New South Wales. His recent research concerns modeling and learning with a particular focus on light weight modeling methods. His doctoral research aimed at improving the validation of, possibly inconsistent, knowledge-based systems in the QMOD specification language. He also has worked as a object-oriented consultant in industry and has authored over 150 publications and served on numerous conference and workshop programs and well as guest editor of journal special issues. He can be reached at tim@timmenzies.net. l1in [width=1in]danport.pdf Dr. Daniel Port, is an Assistant Professor of IT Management at University of Hawaii at Manoa. Prior to this, he was a Research Assistant Professor working with Barry Boehm at USC's Center for Software Engineering, where he now holds the title of Visiting Scholar. Dr. Port has been involved in software development process research, and in the development and assessment of innovative pedagogic techniques for software engineering education. His primary research activities lie in strategic and economic based software engineering. His primary research activities lie in strategic and economic based software engineering. He has applied the strategic method to COTS assessment and COTS process selection, IVV, architecture flexibility, software dependability, and to IT security risk management with collaborators from NASA, JPL, and the Japanese Space Exploration Administration (JAXA). Dr. Port is the co-founder, with Dr. Rick Kazman, of the proposed new Center for Strategic Software Engineering at the University of Hawaii. He can be reached at dport@hawaii.edu. l1in [width=1in]boehm.pdf Professor Barry Boehm received his B.A. degree from Harvard in 1957, and his M.S. and Ph.D. degrees from UCLA in 1961 and 1964, all in Mathematics. Between 1989 and 1992, he served within the U.S. Department of Defense (DoD) as Director of the DARPA Information Science and Technology Office, and as Director of the DDRE Software and Computer Technology Office. He worked at TRW from 1973 to 1989, culminating as Chief Scientist of the Defense Systems Group, and at the Rand Corporation from 1959 to 1973, culminating as Head of the Information Sciences Department. He was a Programmer-Analyst at General Dynamics between 1955 and 1959. His current research interests include software process modeling, software requirements engineering, software architectures, software metrics and cost models, software engineering environments, and knowledge-based software engineering. His honors and awards include Guest Lecturer of the USSR Academy of Sciences (1970), the AIAA Information Systems Award (1979), the J.D. Warnier Prize for Excellence in Information Sciences (1984), the ISPA Freiman Award for Parametric Analysis (1988), the NSIA Grace Murray Hopper Award (1989), the Office of the Secretary of Defense Award for Excellence (1992), the ASQC Lifetime Achievement Award (1994), and the ACM Distinguished Research Award in Software Engineering (1997). He is an AIAA Fellow, an ACM Fellow, an IEEE Fellow, and a member of the National Academy of Engineering. He can be reached at boehm@sunset.usc.edu.