Dear Mr. Tim Menzies: We have received the reports from our advisors on your manuscript, "Stable Rankings for Different Effort Models", which you submitted to Automated Software Engineering. Based on the advice received, your manuscript could be reconsidered for publication should you be prepared to incorporate major revisions. When preparing your revised manuscript, you are asked to carefully consider the reviewer comments which are attached, and submit a list of responses to the comments. Your list of responses should be uploaded as a file in addition to your revised manuscript. PLEASE NOTE: YOUR REVISED VERSION CANNOT BE SUBMITTED IN .PS OR .PDF. IN THE EVENT THAT YOUR REVISED VERSION IS ACCEPTED, YOUR PAPER CAN BE SENT TO PRODUCTION WITHOUT DELAY ONLY IF WE HAVE THE SOURCE FILES ON HAND. Submissions without source files will be returned prior to final acceptance. In order to submit your revised manuscript electronically, please access the following web site: http://ause.edmgr.com/ Your username is: timmenzies Your password is: menzies323 Please click "Author Login" to submit your revision. We look forward to receiving your revised manuscript. Sincerely, Robert J. Hall, PhD, FASE Editor-in-Chief Automated Software Engineering COMMENTS FOR THE AUTHOR: Editor: There is a lot of work to be done here. I believe there is publishable work here that is inadequately explained for readers who are not steeped in the details. There may be some flaws here that will require a bit of re-work in the actual study (cf remarks about "samples/populations" and "bias/validity"), but I trust that should not be too onerous. As you will see, Reviewer 1 is quite negative. I believe s/he makes several important points that must be addressed in the revision. All points must be addressed, but please think very hard about and make very clear your responses/rebuttals to the following. 1. Please discuss Reviewer 1's distinction of "validity" from "accuracy". Are you indeed claiming your "4 best" are actually better for other projects beyond the data used in the study? Or are you merely saying that, while such a claim might be desirable, you are actually only just reporting that in the future researchers should look more closely at the 4 best? I think you mean the latter, but it is up to you to clear this up. 2. I am particularly concerned with Reviewer 1's remark: Furthermore, given that the study used the statistics MMER and MMRE, it is known that estimating models fitted by the criteria MMRE and MMER underestimate. In other words, the models with the apparently best fit are biased. The consequence is severe because it means that these two statistics will systematically rank an invalid estimating model higher than a valid estimating model. Thus, a model fitted by the "minimum MMRE" criterion might well be "reliable" but not "valid", because its estimates will consistently be lower than the "expected effort", and thus farther from the actual effort, the truth. If this is true, then you *really* need to raise the issue of validity and then point this "fact" out. If not, we need to figure out why the reviewer believes this and whether there really is an issue there. Certainly, the way things are now, you seem to be saying your recommended methods are more valid (or more likely to be valid), so if they are instead known to be biased then that is a bad flaw. 3. Please address and clarify the issue of "sample" and "population". I don't know if this is merely expository or if there is something in the study that needs re-analysis or closer scrutiny from a statistical validity standpoint. 4. Generally: please address *all* the points, questions, etc, raised by both reviewers. Of course, let me know if you have any questions. ===================================================================================== Reviewer #1: General There are many objections to be made against this study. First, the study needs very major improvements in the reporting. Most importantly, however, the basic idea of the study does not make sense. Third, the terminology is confusing. As for the reporting, I had to guess several places what the authors actually did, for example in section IV.B. Regarding the terminology, the authors perform an empirical study but do not seem to understand or apply basic statistical concepts and terminology correctly, for example basic statistical notions like "population" and "sample" seems to elude the authors. These notions are crucial for prediction, because prediction is about finding the truth, not necessarily a model with the highest "accuracy". "Accuracy" only implies that your predictions are consistent, but they may nevertheless be consistently wrong. In theory of science, this is the distinction between a "reliable" method and a "valid" method. "Validity" is about truth, whereas "reliability" is about consistency. In the context of predicting the effort of a software project, we are interested in finding a prediction model that predicts the actual effort, i.e. the truth. Using the results from this study as an example: the findings are that method "b" is consistently most "accurate", in terms of the evaluation metrics used (AR, MER, MRE). However, if I am the project manager of project X, I need to predict the workhours of project X. Would the authors of this study recommend that I use method "b"? That is, are you confident that method "b" can be generalised to predict other projects using other data sets than your COCOMO data sets? If you are, you need to provide some rationale because the empirical results only find that method "b" is consistently more "accurate", but how can I trust it provides more valid predictions? Another serious objection is against method "b" itself. Removing outlying observations automatically using the criterion that they are "outlying" cannot be justified. A central issue in a statistical analysis is what to do about outliers. You cannot just remove outliers without a justification, and the fact that these observations have a large MER or AR is no reason alone to remove them from the dataset, just to obtain a dataset with a lower mean MER or AR. As one example of unclear reporting, regarding the evaluation metrics, it is unclear whether the research method used statistics like the mean or median MRE/MER to compare "methods". The study just reports MER and MRE (and AR), which are not statistics, but just a distance measure of a single observation. Furthermore, given that the study used the statistics MMER and MMRE, it is known that estimating models fitted by the criteria MMRE and MMER underestimate. In other words, the models with the apparently best fit are biased. The consequence is severe because it means that these two statistics will systematically rank an invalid estimating model higher than a valid estimating model. Thus, a model fitted by the "minimum MMRE" criterion might well be "reliable" but not "valid", because its estimates will consistently be lower than the "expected effort", and thus farther from the actual effort, the truth. As another example of unclear, confusing reporting, Section III seems to state that the "method" COCOMIN removes COCOMO cost drivers ("columns", "features") whereas Figure 5 seems to indicate that "method b" uses COCOMIN but at the same time "column pruning" equals "x" which I interpret as "no removal of cost driver factors". This is confusing. *********************************** So, to summarise a bit, to the extent that I understand the study, it seems to compare the "prediction accuracy" of "methods" where a "method" basically seems to be a manipulation of the data sample by removal of observations ("rows") and/or variables ("features", columns", or "cost drivers" in COCOMO terminology). It is far from obvious that such automatic removal of observations result in a data sample that is more representative of the population of software projects, and thus will provide an estimate closer to the actual effort of my unknown project X. On the contrary, it is likely that if one cleverly removes observations, for example outliers, then a model fitted to this data will give a better fit, and hence better "prediction accuracy" and "reliability", and so be ranked higher using MER, AR, and MRE as criteria, but not higher "validity". Two of the superior "methods" (b and c) in the study are actually data sets where observations have been automatically removed. Similarly, a clever removal of variables (columns, features) may also give a better fit. This kind of "pruning" of the data is non-sensical. What I miss the most in the study is some common sense, i.e. some reason and explanation why for example method "b" should be a more trustworthy way of estimating my project X. ===================================================================================== Reviewer #2: This is a nice paper that takes a mature approach to a complex and important problem domain, namely selecting a cost estimation predictor. The idea that feature selection is one of the most important aspects seems sensible. Given this insight it's surprising that this isn't factored out and applied separately to the various methods considered eg why look at (i) kNN without it. Also given its importance I think you should give more background on it eg on metaheuristic search based approaches. Also the problem generalizes to searching for feature weights. More discussion on the choice of data set could help. COCOMO is old. More worrying the are the NASA data. Overlaps etc will lead to complex interactions and loss of independence from the results. Perhaps you should just sample randomly with replacement? Why use ranks? Throwing away information and we care about effect size. p3, l41 add value to what? p6 in the sequel ???? p8, l48: why is loss count insightful? A method could win 999 times and have a non-zero count of losses of 1? p9 "this is a ground systems so we should only train our effort estimator using ground system data" I don't understand. Overall a useful paper that should be published.