XXX - check that the references numbers are right

Thank you to the editor and reviewers for their careful commentary
on this paper. Using that feedback we thin we have create a much better
second draft.  Much has changed since the last draft and we believe that
this new version addresses the previously stated concerns.

We write to ask one favor. While this new draft needs to be reviewed
at an appropriate and considered pace, we have a May 9 event where
we stand up before an international workshop on repeatable SE
experiments to discuss methodological issues.  For us, it would be
a proud moment indeed if on May 9 we could:

-  announce at that 2008 workshop that we had made significant
   progress on a TSE publication based on a repeatable SE experiment

- where that experiment was conducted by researchers who only first 
  meet at that workshop in 2007.

So, if possible, and if it is not an imposition, and if it meets with 
your other scheduled tasks, would it possible to have feedback on 
this draft before May 9?

Best wishes,
Tim Menzies, 
Burak Turhan,
Ayse Bener 

\Editor: 1
>Comments to the Author:

>Both reviewers suggest that the second experiment is more relevant,
>possibly because the first experiment produced a negative result
>indicating that cross-company (CC) prediction is not significantly
>better than a random sample. Such negative result, while
>interesting, is not particularly surprising given the lack of publications
>involving CC defect predictions.

[1]. This was a rhetorical error on our part. We completely agree with this
   comment. . What we should have said in the paper was that the conditions 
   under which CC prediction is useful is so extreme that we just should not 
   do it. Therefore the discussion section of section VI.d is now much more
   negative.

>Even more significantly, both reviewers question the CC nature of the
>work. The basic reason CC models
>tend to fail is that companies work in different domains, have
>different process, and define/measure defects and other aspects of
>their product and process in different ways. 

[2]. Excellent point- with your permission, we would use the above
   ideas in the new introduction (paragraph 2, section I).

> Given that projects
>came from NASA, it is not clear to what extent the domain, process,
>and measurement varied among the projects throwing doubt on the
>premise that the experiment was performed on CC data.  

[3]. Because of your reviews, we searched for more data. We found some-
   three new data sets, in fact. This new data comes not from NASA 
   aerospace applications but from a Turkish company that writes 
   software for domestic home appliances. Nervously, we repeated our
   experiments on this data and we are happy to report that the
   the WC vs CC patterns we found in in American NASA rockets
   also hold for Turkish washing machines. While this is not 
   definitive proof of the universal external validity of our
   results, we find it a very compelling result suggesting that the
   experiments of this paper apply to a very wide range of software
   from around the world.

>The other major concern was the lack of comparison to a very rich
>body of research in defect prediction area. Some of the more
>effective prediction techniques rely on project history (see, e.g.,
>[1]). Clearly, projects with nontrivial history are less likely to
>benefit from CC models, while contractor relationships in NASA case
>may not be obviously subject to historic models without having
>several projects for a contractor. Therefore, the lack of value of
>CC techniques in a case where they are arguably the most likely to
>succeed is a concern. Lack of comparisons to a context where
>the defect prediction does appear to add value is disappointing.

[4]. We added extensive notes on other work (see section II.B)- including
   a note on paper [1]- see top of column 2, page 3.

[5]. And we have added notes on your above paragraph to the end of
    section IV. See also, point [10] (below).
 
>[1] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault
>incidence using software change history. IEEE Transactions on
>Software Engineering, 26(2), 2000.

>=======================================
>
>REVIEWS
>
>Reviewer: 1
>
>Major issues:
>
>I'm not convinced of the usefulness of the results of either
experiment. Given that most projects have more than 90% defect-free
modules, a sizeable number of false positives will need to be tested.
>From experiment, the pf median is at 52% for CC which means over
>45% of modules with no defects (assuming 90% defect-free) will need
>to be tested. This seems to be no better than picking a module at
>random. The assertion that this may be applicable for mission-critical
>systems is doubtful because organizations building mission-critical
>systems are: (a) very likely following different and highly specialized
>development processes, (b) developing more complex systems, and (c)
>not likely going to base their QA decisions on data from untrusted
>sources. It is also likely that they will retest everything anyway.
>Perhaps it might help if the datasets were stratified so that
>"mission-critical" projects are separated from those that aren't.
>If CC only contained the mission-critical projects, and the results
>are still similar, this may at least dispel objections (a) and (b).

[6]. the reviewer is kind to offer a solution to problems (a) and (b) but
   we prefer a more radical fix.  See comment [1] above

	Our discussions with our new data source revealed that they operate
	in a highly competitive market segment where profit margins are
	very tight. Therefore reducing the cost of the product even by 1
	cent makes a huge difference both in market share and profits. Their
	applications are embedded applications where the software component
	has taken precedence over the hardware in the last decade. Hence
	their problem is a software engineering problem. Confirming the
	world statistics in software development their main challange is
	the testing phase. They would like to have predictors to tell them
	where the defects are before the testing starts so that they could
	efficiently use their scarce resources. Moreover they also need to
	answer the question of how much to test? and what is the trade-off
	of detecting more at the cost of testing more?. Therefore our
	research and its results are quite relevant for them. As a matter
	of fact we have started using our predictor in their development
	environment and got excellent results. Now they are in the process
	of embedding defect prediction into their software development
	process.
	
>For Experiment #2, eye-balling Figure 5, the average balance across
all projects appears to be around 0.7. Based on the balance equation,
the pd would be less than 60% if pf is less than 10% and less than
65% if pf is less than 25%. pf goes over 40% as pd approaches 100%.
To me, these results do not seem like useful predictors. How does
this compare to published results using other techniques?

[7] That baseline data has been added to section II. In short,
    we are in the ball park of other machine learning methods and better
    than current industrial practices (but see comment [19] below).

>The result of experiment #2 where 100 data points in the training
>set seem to be sufficient most of the time is very interesting.
>However, for this to be a practical guideline, external validity
>should be addressed. In addition to the cited references [17,21],
>does this apply to other datasets as well? How about defect data
>from open source projects? Alternately, is there a theory to explain
>this?

[8]. We are very happy to report that this result holds for data
     that is _radically_ different to NASA aerospace applications. See
	 comment [3] (above)

>There is no section discussing previous related work. As a study
> on predicting defect-prone modules, the authors miss a large body
>of existing work on defect prediction, especially prediction based
> on history data.

[9]. We agree. See comment [4] above.

>Does not address some of the common questions with cross-company
>data. For example: - Is the development process homogeneous? Do
>they have similar QA processes? Are the data collection procedures
>similar?  - Is there a uniform definition of "defect"? If there
>is, does it include inspection defects? Is it only post-release
>defects? Are these defects based on problem reports?  - And if
>there were non-uniform processes or definitions, what steps were
>taken to standardize the metrics?  These should be discussed as
>part of threats to validity.

[10]. One challenge with data mining on real world data sets is
     that much of the contextual knowledge is unavailable. All we
     can usually get is the source code. Other important issues
     relating to software process, measurement control, etc may not
     be available. It is certainly true that our predictors
     would be better if we added this information to our data sets.
     However, the absence of that information from our data is
     not a fatal flaw in this approach.
     We note that even without that data, we can generate useful
     predictors (and note that the new section IIb and section IVd
     now has extensive notes on the meaning of "useful").  

>Other comments:
>
>Missing keywords.

[11]. Added

>Page 2 (also in Abstract): no explanation was given why the authors
>decided to do defect detection rather than effort estimation. After
>discussing at length about effort estimation in the opening paragraphs
>of Section I, suddenly, the authors presented their work on defect
>detection. Perhaps the discussion on effort estimation can be
>relegated to a "related works" section. Alternately, discuss how
>static code-based predictors can be used to estimate effort.

[12] This was a rhetorical error on our part and it is corrected
     in this draft. Firstly, the abstract has been shortened.
     Secondly, as stated in the introduction, elsewhere we have
     explored CC data for effort estimation so this paper turns
     its attention to other areas (defect prediction). Thirdly,
     most of our notes on effort estimation have been isolated in a
     clearly labeled separate section (See IIA).

>Page 4, paragraph 3, 1st bullet: there is no reference associated with Raffo.

[13] Added

>Page 5, many of the features listed in the Appendix (Figure 8) are highly correlated. 
> Discuss how this affects the independence assumptions of Bayesian classification.

[14] This is very very interesting point and we added notes in the paper to the
     fascinating 1995 results of Pazzani and Domingos (see column II, page 5). 
     They compared that the 
     independence assumption of naive  bayes assumption to a theoretical "better
     bayes" classifier that knew when to delete correlated attributes. "Better bayes"
     turns out to be actually better in only a vanishingly small number of cases. This
	 is a truly impressive paper and we highly recommend it to our graduate students
 	 as a shining example of just how good a data mining paper can be:

	P. Domingos and M. J. Pazzani, "On the optimality of the simple
	Bayesian classifier under zero-one loss," Machine Learning, vol. 29,
	no. 2-3, pp. 103¿130, 1997. [Online]. Available: 
    citeseer.ist.psu.edu/domingos97optimality.html
	

>Page 7, Figure 2: in addition to the summary in Fig. 2, it may be helpful to list 
> the average pd and pf values (for both CC and WC) for each project.

[15]. added

>Typo:
>Page 6, Section III.B, sentence after the bullet list: there are two "than"s.

[16]. fixed


 --------------------------------------------------------------

>Reviewer: 2
>
>Recommendation: Author Should Prepare A Major Revision For A Second Review
>
>I would like to see an extension of the research to projects outside
>the NASA realm.  Despite the fact that the data presented are from
>different NASA organizations, this reader still has the feeling
>that the projects are at least related, or similar types of
>applications, and perhaps the development teams all share a particular
>"NASA-type" development mindset.
>I don't require this extension to accept the paper, but it would
>undoubtedly make the research more valuable and more widely applicable,
>and would certainly strengthen the paper.

[17]. We found new data. See comment [3] above

>The Appendix is unnecessary, and I suggest moving the table of
>static code features into the body of the paper, at page 4, around
>line 12.
>The text there says "tables of examples are formed(like those in
>Fig 1)".  But Fig 1 is only a summary of the systems, while the
>table referred to is actually in the Appendix.
>The reader would like to know right then what the features are,
>without having to look for the Appendix.

[18]. Done

>I don't believe it's meaningful to compare the defect predictors
>in this paper to the detection results by Raffo or the IEEE Metrics
>panel (page 4, ll. 21-42) Those results concern defect finding by
>inspections, and by their nature they detect not just potential
>faulty modules, but they also provide specific debugging information,
>i.e, they find the specific flaws in the code.

[19]. 
     We agree with the reviewer that inspections can find errors AND offer advice
     on how to fix them. Our static code detectors, on the other hand can only
     point to problems, not explain them or offer methods to fix them 
	 (actually, there are other testing regimes that are also silent on
     how to fix errors; e.g.  a unit test can report that the output of a 
     program is not what was
     expected, but it can't explain why the differences exist nor where
     to fix them). But here are the two issues that need to be separated:

     a) there is the cost of running a oracle that can indicate problematic code areas
	 b) there is the cost of fixing those problems.

     We _can_ compare our results with raffo/metrics02 on issue (a). But,
	 as the reviewer correctly observes, issue (b) remains.  

     Footnote #7 (p4) explores this issue. Our answer is incomplete in
     that it requires further data collection. Given the currently available
     data, all we can report on at this time is issue (a).

>The discussion in the middle of page 5 (Bayes theorem) needs to be
>expanded and better explained, and related to the paper's datasets.
>In particular, explain the equation on line 35, and explain what E and E-sub-i represent for your data.  

[20]. Done

>It would also be helpful to mention the specific non-Bayesian methods
>that you investigated in [6] and found did not perform as well.

[21].Added

>SYNTACTIC ISSUES:

>Specific places to make the text clearer, or remove typos.  Words in single quotes '  ' should be inserted. 
>Words in angles <  > should be removed. 
>(The line numbers refer to the numbers printed in the Manuscript-central printout, not to the actual 
> count of lines on each page. )

[22]. done