Your username is: timmenzies
Your password is: menzies323

Please click "Author Login" to submit your revision.

We look forward to receiving your revised manuscript.

Sincerely,

    Robert J. Hall, PhD, FASE
     Editor-in-Chief
    Automated Software Engineering


COMMENTS FOR THE AUTHOR:


EDITOR'S SUMMARY:
As you can see, there is quite a bit of resistance from reviewer #2 to this paper. I think the constructive substance of reviewer #2's comments
can be addressed in a major revision, on top of paying close attention to reviewer #1's suggestions. In particular, please first pay close attention to and make sure
to address reviewer#1's suggestions. Second, please address Reviewer#2's remarks, especially (a) address the regression methods discussed by reviewer#2;
actually comparing by means of adding to the study would be most preferable, but if you feel these methods are somehow off the table, please say in detail why.
(b) the balance of your related work seems a bit out of whack; please give more space to the other authors in the field listed by reviewer#2, if necessary reducing
or compacting your references to your own work. (c) please address the methodological issues raised as well, such as future-vs-present tests and what about
using project history?

REVIEWER #1: The paper is well written. Its organization is good. I like the background section which gives a lot of information about the related studies and different viewpoints. Generally speaking, I have a favorable opinion of their research. However, in this paper, there are important points that deserve consideration and probably additional data analysis and discussion, which requires a major revision. The detailed comments are as follows:

1) A simple ManualUp method gives very good results, which is surprising. A workshop paper is given in the references but is there any other supporting publication or evidence in the literature?

2) On page 5, the authors try to justify the use of binary measures. However, could this use of binary outcomes be responsible from:

* the curves in Figure 8? It is highly possible that big modules/classes will have more faults. Therefore, counting every defective class as "1" or "true" regardless of its actual count of faults can result in the curves seen in Figure 8.

* the ceiling effect observed for different learners in many different studies.

3) Re: Figure 8
I am not clear on what corresponds to a point on the x-y axis in this plot. Just for example, let's assume that you have

100 modules of 10 LOC
10 modules of 15 LOC
1 module of 20 LOC

in a product. If the x-axis is %LOC, it becomes necessary to mark the three percentile values for total LOC, at the points 10, 15, and 20 LOC, and their three corresponding PD values. Or, would the authors simply sort all 100 modules of 10 LOC one after each other, and 10 modules of 15 LOC one after another, and so on??

If the authors follow the latter approach, this can greatly affect the shape of the curves. Please elaborate on this issue.

It might be useful to look into the mathematical properties of such curves. Solely relying on that an earlier conference paper used this approach (first reference) seems to be naive. Note that, with the latter approach, different distributions of LOC in different data sets could greatly affect the shape of the curves.

4) If AUC(effort,pd) is the area under the curve, this means that the authors always make comparisons with the worst case scenario (pd jumps to 100% at 100% LOC). Do the authors have any idea about how a random order of modules would perform on average when many random orderings are produced? This kind of randomness must be considered because if the total area under the curve is used as a metric, then the orderings that are indeed worse than random ordering will seem like they are still favorable. This is because there will be always some area under the curve.

5) Normalizing AUC(effort,pd) by the best line in Figure 8 can result in difficulties while making comparisons rather than simplifying those comparisons. This is because the best curve can be different from one product to another and normalizing by the best curve can cause unpredictable results and comparisons. Instead, involving the performance of random orderings in the calculations seems to be necessary.

6) The authors state that the recent results have not improved the performances of different learners. This observation is used as a motivation to explore an alternative to the standard goal for learners. Then, the rest of the paper continues on explaining how the use of this new goal results in improved performance for learners. It should be noted that trying different goals for learners randomly is an evolutionary process for research. Consequently, this approach is perhaps too expensive because it spans over many studies.

The paper presents interesting results. However, it does not discuss how model builders should go about their research design. It is the initial decisions made in the research design which affects all of the results (in this case, similar results from various learners developed in different studies). Would there be more effective research guidelines to be given to software engineering researchers so that they can build learners that matter?

7) Do you assume that effort is proportional to lines of code?


REVIEWER #2: Defect prediction is an important area for software engineering research, and has many potential practical uses in software development. Additionally, I think that it is essential to do empirical studies and believe that replication is respectable and important. But these authors have churned out at least
several dozen papers (I think I counted roughly two dozen by subsets of these authors cited in this paper alone) and I have yet to see anything of practical value described in any one of them. This is a paper production project, and the closing sentence predicts that more are to come ("We hope that this paper prompts a new cycle of defect prediction research ...")and I predict that their prediction will come true by them.

I see several major flaws in their research:

1) The point of being able to accurately predict which modules will contain defects is to allow the user to predict the FUTURE. I want to be able to look at data that I have NOW and be able to say which modules will contain defects LATER (and of course get the prediction right). However, these authors persist in doing holdout experiments in which they are making predictions about NOW from data collected NOW, and then I suppose that they are claiming that this really tells the user what will happen LATER. What is the evidence for that? Would it convince any practitioner to go ahead and apply something from this paper? Software engineering research should ultimately help practitioners improve the way they engineer software. I just don't see this paper moving forward towards this goal.

2) Additionally, the authors restrict attention to static code features to make predictions when there is substantial evidence that information about the history of modules such as whether they were defective in the past or have been extensively changed (churn data) is at least as important as static features. But this is not included. Of course since they are not predicting the future, there is no history.

3) The paper restricts attention to machine learning predictors to the exclusion of standard statistical regression models. I am not sure why. But given that there is evidence that regression models can be very successful at defect prediction, I do not understand why they do not consider linear or binomial regression, for example.

4) They only consider binary classifications. The module is either buggy or not. They say that that is because they only do pre-release analysis and there might be defects after release. Of course that is true, but that is what a project needs predicted.

Another issue of concern is the set of papers cited. As mentioned above, they cite roughly 24 papers by members of this group. You would think that they are the only people doing research of this nature. This is certainly not true. I saw NO papers by Zimmerman, Zeller, or Mockus all of which are certainly presences in the field and have done high-quality defect prediction research. Both Zimmerman and Mockus work in industry, by the way. Additionally, I only noticed one paper by Ostrand and Weyuker who have worked in the field for many years, one paper by Briand, and two by Nagappan (one of his papers appears twice in the references). Each of these researchers have done major studies using industrial software, and all work in industry. The group at AT&T have made predictions for several different systems over many different releases. I am certain that other researchers I have forgotten have been similarly overlooked. And even when a paper by, say Nagappan or Briand or Weyuker
was cited, it was done just in passing and is not necessarily the most relevant paper to cite. Notice that all of the above-mentioned authors whose work have gone uncited or mnder-cited are working in an industrial setting and at least have some potential for showing industrial technology transfer, even if their work is not being used yet.

The bottom line is that I do not see a clear value of this paper other than as a springboard for these authors to write yet more papers.