XXX - check that the references numbers are right Thank you to the editor and reviewers for their careful commentary on this paper. Using that feedback we thin we have create a much better second draft. Much has changed since the last draft and we believe that this new version addresses the previously stated concerns. We write to ask one favor. While this new draft needs to be reviewed at an appropriate and considered pace, we have a May 9 event where we stand up before an international workshop on repeatable SE experiments to discuss methodological issues. For us, it would be a proud moment indeed if on May 9 we could: - announce at that 2008 workshop that we had made significant progress on a TSE publication based on a repeatable SE experiment - where that experiment was conducted by researchers who only first meet at that workshop in 2007. So, if possible, and if it is not an imposition, and if it meets with your other scheduled tasks, would it possible to have feedback on this draft before May 9? Best wishes, Tim Menzies, Burak Turhan, Ayse Bener \Editor: 1 >Comments to the Author: >Both reviewers suggest that the second experiment is more relevant, >possibly because the first experiment produced a negative result >indicating that cross-company (CC) prediction is not significantly >better than a random sample. Such negative result, while >interesting, is not particularly surprising given the lack of publications >involving CC defect predictions. [1]. This was a rhetorical error on our part. We completely agree with this comment. . What we should have said in the paper was that the conditions under which CC prediction is useful is so extreme that we just should not do it. Therefore the discussion section of section VI.d is now much more negative. >Even more significantly, both reviewers question the CC nature of the >work. The basic reason CC models >tend to fail is that companies work in different domains, have >different process, and define/measure defects and other aspects of >their product and process in different ways. [2]. Excellent point- with your permission, we would use the above ideas in the new introduction (paragraph 2, section I). > Given that projects >came from NASA, it is not clear to what extent the domain, process, >and measurement varied among the projects throwing doubt on the >premise that the experiment was performed on CC data. [3]. Because of your reviews, we searched for more data. We found some- three new data sets, in fact. This new data comes not from NASA aerospace applications but from a Turkish company that writes software for domestic home appliances. Nervously, we repeated our experiments on this data and we are happy to report that the the WC vs CC patterns we found in in American NASA rockets also hold for Turkish washing machines. While this is not definitive proof of the universal external validity of our results, we find it a very compelling result suggesting that the experiments of this paper apply to a very wide range of software from around the world. >The other major concern was the lack of comparison to a very rich >body of research in defect prediction area. Some of the more >effective prediction techniques rely on project history (see, e.g., >[1]). Clearly, projects with nontrivial history are less likely to >benefit from CC models, while contractor relationships in NASA case >may not be obviously subject to historic models without having >several projects for a contractor. Therefore, the lack of value of >CC techniques in a case where they are arguably the most likely to >succeed is a concern. Lack of comparisons to a context where >the defect prediction does appear to add value is disappointing. [4]. We added extensive notes on other work (see section II.B)- including a note on paper [1]- see top of column 2, page 3. [5]. And we have added notes on your above paragraph to the end of section IV. See also, point [10] (below). >[1] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault >incidence using software change history. IEEE Transactions on >Software Engineering, 26(2), 2000. >======================================= > >REVIEWS > >Reviewer: 1 > >Major issues: > >I'm not convinced of the usefulness of the results of either experiment. Given that most projects have more than 90% defect-free modules, a sizeable number of false positives will need to be tested. >From experiment, the pf median is at 52% for CC which means over >45% of modules with no defects (assuming 90% defect-free) will need >to be tested. This seems to be no better than picking a module at >random. The assertion that this may be applicable for mission-critical >systems is doubtful because organizations building mission-critical >systems are: (a) very likely following different and highly specialized >development processes, (b) developing more complex systems, and (c) >not likely going to base their QA decisions on data from untrusted >sources. It is also likely that they will retest everything anyway. >Perhaps it might help if the datasets were stratified so that >"mission-critical" projects are separated from those that aren't. >If CC only contained the mission-critical projects, and the results >are still similar, this may at least dispel objections (a) and (b). [6]. the reviewer is kind to offer a solution to problems (a) and (b) but we prefer a more radical fix. See comment [1] above Our discussions with our new data source revealed that they operate in a highly competitive market segment where profit margins are very tight. Therefore reducing the cost of the product even by 1 cent makes a huge difference both in market share and profits. Their applications are embedded applications where the software component has taken precedence over the hardware in the last decade. Hence their problem is a software engineering problem. Confirming the world statistics in software development their main challange is the testing phase. They would like to have predictors to tell them where the defects are before the testing starts so that they could efficiently use their scarce resources. Moreover they also need to answer the question of how much to test? and what is the trade-off of detecting more at the cost of testing more?. Therefore our research and its results are quite relevant for them. As a matter of fact we have started using our predictor in their development environment and got excellent results. Now they are in the process of embedding defect prediction into their software development process. >For Experiment #2, eye-balling Figure 5, the average balance across all projects appears to be around 0.7. Based on the balance equation, the pd would be less than 60% if pf is less than 10% and less than 65% if pf is less than 25%. pf goes over 40% as pd approaches 100%. To me, these results do not seem like useful predictors. How does this compare to published results using other techniques? [7] That baseline data has been added to section II. In short, we are in the ball park of other machine learning methods and better than current industrial practices (but see comment [19] below). >The result of experiment #2 where 100 data points in the training >set seem to be sufficient most of the time is very interesting. >However, for this to be a practical guideline, external validity >should be addressed. In addition to the cited references [17,21], >does this apply to other datasets as well? How about defect data >from open source projects? Alternately, is there a theory to explain >this? [8]. We are very happy to report that this result holds for data that is _radically_ different to NASA aerospace applications. See comment [3] (above) >There is no section discussing previous related work. As a study > on predicting defect-prone modules, the authors miss a large body >of existing work on defect prediction, especially prediction based > on history data. [9]. We agree. See comment [4] above. >Does not address some of the common questions with cross-company >data. For example: - Is the development process homogeneous? Do >they have similar QA processes? Are the data collection procedures >similar? - Is there a uniform definition of "defect"? If there >is, does it include inspection defects? Is it only post-release >defects? Are these defects based on problem reports? - And if >there were non-uniform processes or definitions, what steps were >taken to standardize the metrics? These should be discussed as >part of threats to validity. [10]. One challenge with data mining on real world data sets is that much of the contextual knowledge is unavailable. All we can usually get is the source code. Other important issues relating to software process, measurement control, etc may not be available. It is certainly true that our predictors would be better if we added this information to our data sets. However, the absence of that information from our data is not a fatal flaw in this approach. We note that even without that data, we can generate useful predictors (and note that the new section IIb and section IVd now has extensive notes on the meaning of "useful"). >Other comments: > >Missing keywords. [11]. Added >Page 2 (also in Abstract): no explanation was given why the authors >decided to do defect detection rather than effort estimation. After >discussing at length about effort estimation in the opening paragraphs >of Section I, suddenly, the authors presented their work on defect >detection. Perhaps the discussion on effort estimation can be >relegated to a "related works" section. Alternately, discuss how >static code-based predictors can be used to estimate effort. [12] This was a rhetorical error on our part and it is corrected in this draft. Firstly, the abstract has been shortened. Secondly, as stated in the introduction, elsewhere we have explored CC data for effort estimation so this paper turns its attention to other areas (defect prediction). Thirdly, most of our notes on effort estimation have been isolated in a clearly labeled separate section (See IIA). >Page 4, paragraph 3, 1st bullet: there is no reference associated with Raffo. [13] Added >Page 5, many of the features listed in the Appendix (Figure 8) are highly correlated. > Discuss how this affects the independence assumptions of Bayesian classification. [14] This is very very interesting point and we added notes in the paper to the fascinating 1995 results of Pazzani and Domingos (see column II, page 5). They compared that the independence assumption of naive bayes assumption to a theoretical "better bayes" classifier that knew when to delete correlated attributes. "Better bayes" turns out to be actually better in only a vanishingly small number of cases. This is a truly impressive paper and we highly recommend it to our graduate students as a shining example of just how good a data mining paper can be: P. Domingos and M. J. Pazzani, "On the optimality of the simple Bayesian classifier under zero-one loss," Machine Learning, vol. 29, no. 2-3, pp. 103¿130, 1997. [Online]. Available: citeseer.ist.psu.edu/domingos97optimality.html >Page 7, Figure 2: in addition to the summary in Fig. 2, it may be helpful to list > the average pd and pf values (for both CC and WC) for each project. [15]. added >Typo: >Page 6, Section III.B, sentence after the bullet list: there are two "than"s. [16]. fixed -------------------------------------------------------------- >Reviewer: 2 > >Recommendation: Author Should Prepare A Major Revision For A Second Review > >I would like to see an extension of the research to projects outside >the NASA realm. Despite the fact that the data presented are from >different NASA organizations, this reader still has the feeling >that the projects are at least related, or similar types of >applications, and perhaps the development teams all share a particular >"NASA-type" development mindset. >I don't require this extension to accept the paper, but it would >undoubtedly make the research more valuable and more widely applicable, >and would certainly strengthen the paper. [17]. We found new data. See comment [3] above >The Appendix is unnecessary, and I suggest moving the table of >static code features into the body of the paper, at page 4, around >line 12. >The text there says "tables of examples are formed(like those in >Fig 1)". But Fig 1 is only a summary of the systems, while the >table referred to is actually in the Appendix. >The reader would like to know right then what the features are, >without having to look for the Appendix. [18]. Done >I don't believe it's meaningful to compare the defect predictors >in this paper to the detection results by Raffo or the IEEE Metrics >panel (page 4, ll. 21-42) Those results concern defect finding by >inspections, and by their nature they detect not just potential >faulty modules, but they also provide specific debugging information, >i.e, they find the specific flaws in the code. [19]. We agree with the reviewer that inspections can find errors AND offer advice on how to fix them. Our static code detectors, on the other hand can only point to problems, not explain them or offer methods to fix them (actually, there are other testing regimes that are also silent on how to fix errors; e.g. a unit test can report that the output of a program is not what was expected, but it can't explain why the differences exist nor where to fix them). But here are the two issues that need to be separated: a) there is the cost of running a oracle that can indicate problematic code areas b) there is the cost of fixing those problems. We _can_ compare our results with raffo/metrics02 on issue (a). But, as the reviewer correctly observes, issue (b) remains. Footnote #7 (p4) explores this issue. Our answer is incomplete in that it requires further data collection. Given the currently available data, all we can report on at this time is issue (a). >The discussion in the middle of page 5 (Bayes theorem) needs to be >expanded and better explained, and related to the paper's datasets. >In particular, explain the equation on line 35, and explain what E and E-sub-i represent for your data. [20]. Done >It would also be helpful to mention the specific non-Bayesian methods >that you investigated in [6] and found did not perform as well. [21].Added >SYNTACTIC ISSUES: >Specific places to make the text clearer, or remove typos. Words in single quotes ' ' should be inserted. >Words in angles < > should be removed. >(The line numbers refer to the numbers printed in the Manuscript-central printout, not to the actual > count of lines on each page. ) [22]. done