\documentclass{article}
\title{Reply to reviewers}
\begin{document}

\section{Comments to Editor}

Thank you for your careful review. We attach a (much) shorter revision of our paper. 
Based on your reviews, we were able to cut away much of the irrelevant or poorly-argued material. The result
is a much cleaner, clearer paper that we hope will satisfy the reviewers.

With only a few exceptions, many of the issues raised by the reviewers relate to material removed from this draft.  Hence,
we offer replies only to the remaining issues (see below).

As to the editor's comments, there was a note sent to us querying the overall point and value of this kind of research. In this paper, we focus heavily on the Arisholm \& Briand bias; i.e. minimize LOC in the modules predicted to be defective, but maximize
the number of defective modules found.
This is a goal that we have been asked to achieve by numerous industrial companies.


{\em Nowhere do the authors actually say what is being predicted by these models.
I presume it is something like 'number of faulty modules', or which modules are likely to be the faulty ones.}

That has been corrected in the text where we say 
\begin{quote}
In our experiments,
we ask our learners to make a binary decision ($defective,
nonDefective$).  All the modules identified as $defective$ are
then sorted in
 order of increasing size (LOC).
We then assess their performance by $AUC(Effort,PD)$; i.e. the area
under the $effort$-vs-$PD$ curve.  ''\end{quote}

(and $effort$ is defined above those text as ``%LOC in the modules that trigger the detector''.


{\em
It would help if the reader could see more clearly 
the connection between problem and its (sometimes quite hard to understand) solution.}

We hope that the current, more concise, draft will better explain our thesis.
\section{Reviewer \#1}

{\em It is not clear why the same (or similar) criteria couldn't be used in both predictor development and evaluation, like evaluation of most data mining approaches?}

The issue here is that it is {\em not standard} in the defect prediction literature to use the same criteria used
for 
\bi
\item
The search criteria used internally by the learner to grow the theory 
\item
and the evaluation criteria used to evaluate the model.
\ei
Our WHICH learner shows that it is is possible to build a learner where we can move the evaluation criteria into the core of the learning.


{\em In this study, authors attempted to approximate P and Q (page 20). However, it seems that the values of weights/parameters in equation (4) were randomly determined. Could those values be different for different data sets? If not, what are practical guidelines that can help users determine appropriate those values?}

Because of this comment, we now mention in the paper that 
\begin{quote}
WHICH is a new learner with scope for much future experimentation:
\bi
\item Should the number of loops be set dynamically? 
\item
Do
different discretization policies improve WHICH? 
\item
Are there better values 
for $(\alpha,\beta,\gamma)$ than \eq{weights}?
\ei
It is possible that answers to these and  other questions could improve
WHICH's performance, but that is not the point of this paper.  To
comment on the generality of Lessmann et al.'s result, we only need
to show one example where knowledge of the evaluation biases alters
which learner ``wins'' a comparative evaluation study. The current
version of WHICH offers that example.
\end{quote}

{\em The performance comparison results shown in Figures 17-19 did not present statistical significance of differences in performance of different approaches. Therefore, the conclusion that hypothesis was supported is questionable.}

This is our fault- the statistical tests were included in column 1 of figures 17 to 19 but the presence of those tests
were not explained or emphasized sufficiently in the previous draft.
Because of this reviewer's comment, now we take care to say:
\begin{quote}
The first column of each division shows the results 
of a Mann-Whitney U ranked non-paired test (95\% confidence). 
Row $i$ gets a higher rank than row $j$ ($j>i$) if it's median performance is higher than 
all rows $j$
and Mann Whitney reports that the distributions in row $i$  is different to all rows $j$.
We use this statistical test  for reasons discussed in the appendix.
\end{quote}

\section{Reviewer \#2}

{\em In the introduction, the authors discuss the study by Lessmann et al. that they use to justify 
this ceiling effect.  However, in the conclusion, they acknowledge that this ceiling effect was 
observed when assessing learners using accuracy.  The extreme dangers of using accuracy to 
assess learner performance have been well-documented, and this result is of little practical 
use. }

This comment is in error, and the error stems from our previous draft. Lessmann et al. did not use accuracy. Rather, they
used PD,PD which does not suffer from the problems that the reviewer (quite rightly) mentions above.


{\em Relative to metric Q, the size of 
the software module is critical, if that module does not contain faults.  There should be a high 
misclassification cost for such false positives.  } 

We agree - the size of the module is critical and that is the case made by Arisholm \& Briand. Our new learner, tuned
to avoid the useless effort of triggering inappropriately on such larger modules, takes into account the size of the modules.

\section{Reviewer \#3}

{\em  Can the parameters (alpha, beta, gamma) of the training criteria (Eq. 4, p. 20) be learned/estimated/optimized?}

Good suggestion, but that is not the point of this paper. We are exploring that work in other research.

{\em  If the observation 2 (above) made by the authors is correct, does it means that the performance of any data miner can be improved simply by matching its training criteria to the application-dependent testing criteria?}

We suspect not since not all learners are tunable in this way. For example, Naive Bayes is hard-wired to do what it does.


3. Can the approach be generalized to other application task, and therefore other training/testing criteria?
}

Perhaps, but that is not the point of this paper.


\end{document}