====================================== REVIEWS Reviewer: 1 Recommendation: Author Should Prepare A Major Revision For A Second Review Comments: I think this paper asks an extremely important question. Until it is properly addressed publishing more and more here's an empirical study comparing techniques A and B on data set C type papers is pointless. However I have significant concerns with the answer. The authors are right to suggest that high variability in accuracy measures such absolute residual means that some non-robust indicators of center eg MMRE are going to be unduly sensitive. BUT ... V -> I where V is high variance in the residuals and I is conclusion instability does not mean that V is necessary, ie I -> V. So I think the paper misses many other factors and makes over strong conclusions. Other factors that are likely to be relevant include data set heterogeneity of data sets and methods. The paper isn't very detailed on this but I would guess the 2 datasets are similar because they share the same features. I'm unfamiliar with COSEEKMO but the fact that presumably expert judgement and k-NN methods are excluded may be relevant. In addition there is work by Shepperd and Kadoda (2001, TSE) where they use simulated data and find that no one prediction technique dominates (ie conclusion instability). You should relate your work to their conclusions. So all in all I think the situation is more complex than simply using a *robust* test of difference in sample centers. In any case using Mann Whitney U is hardly new - it's just the non-parametric analog of the t test. From an extremely quick search I find as early as 1999 Myrtveit and Stensrud use the U test when comparing analogy and OLS regression (TSE 1999). In addition the problem goes away as more and more samples are taken from the underlying population (as a ball park figure 30 should be sufficient). Also try to do more scene setting for the non-specialist readership ie motivate me early on. More detailed comments: Given the various studies, simulations etc I do not know why researchers continue to use such obviously flawed and biased accuracy indicators such as RE or MMRE (p5). Mann Whitney is widely known and understood. I would not waste pages on it. Just point out it tests the difference between sample medians. Those unfamiliar can go and look it up themselves. In contrast you don't explain COCOMO so be consistent - and I don't mean explain COCOMO either! (pp6-7) What is a CBR style data set?? (p8) (a, ... , i) is 9 methods! (p9) et. al. -> et al. p10 and onwards. It is extremely hard to work out what different prediction methods you are comparing eg there are lots of things you can do with a wrapper so much more info is needed here. From the algorithmic complexity you give I assume you are doing an exhaustive search -- so say so! Figs 5-7 are hard to compare. Also we don't know that it is the same case that loses since you just count (unless losses =0). p18: OLS minimizes the SSRs not the differences. Also the notation is non-standard. Why not X_{i}? For that matter why bother to explain OLS?? p20: Ref [35] authors are incorrect? XXX i believe so. does the reviewer have another name? Despite all of the above I hope that the authors work with these problems and that this paper gets into the public domain asap. How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Very Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: Partially 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: No 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: Important references are missing; more references are needed 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Difficult to read and understand Please rate the manuscript. Explain your rating under the Public Comments section below.: Fair Reviewer: 2 Recommendation: Author Should Prepare A Major Revision For A Second Review Comments: The authors demonstrate conclusion stability for a portfolio of 154 methods (I would prefer the term 'instances of a method' as just changing some parameters in a process does not create a new method) when applied to 19 subsets of two sources. This is a revision of former results achieved by KFM claiming the opposite. The main limitation of the research to me is the use of just two data sets (coming from 1981 and 1993). Even though there are subsets created, this still does not change anything on this limitation. This is especially true for the type of conclusions drawn to demonstrate conclusion stability. I can not see the reason why further data sets are used to validate the stated proposition. XXX tse as good In addition, the following questions should be addressed for re-submission of the paper: 1. In Figure 1, the Y-axis is labeled as method2(MRE) – method1(MRE). However, the explanation is "Results of two different runs of COSEEKMO comparing two methods using mean MRE values." Please clarify whether MRE or MMRE is used. removed 2. It is unclear why Gaussian distribution is used as an assumption for running the t-test. In the last paragraph of page 5, the authors say: "Such t-tests assume that the distributions being studied are Gaussian and …". However, reference [15] (assumed to backup this assumption) does not say this (see the last paragraph of page 6 in [15]). i removed See the links below for references discussing the assumptions for t-test: http://www.fammed.ouhsc.edu/tutor/tstds.htm http://www.csic.cornell.edu/Elrod/t-test/t-test-assumptions.html http://www.basic.northwestern.edu/statguidefiles/ttest_unpaired_ass_viol.html 3. Regarding the example of U-test in Fig. 3 on page 7, explanations to some terms and variables are necessary to make the example complete and easy to understand in order for the readers to understand the U test in section III Experiments. (1) N1 and N2 are used in Fig. 3 without explanation. (2) Terms ties, wins and losses need explanation. (3) The last sentence "In this work ,…" and the bullets should be part of the experimental design of this research, not the results of this example. (4) The operations of adding one to either losses or wins remains unclear to me. more notes 4. The results of the experiments on page 13 can not be seen to be concluded from U-test. What is the statistical significance of the comparison in terms of p-value and confidence level? 95% confidence The answer to the above question is critical to the paper, because the "statistical tests that distinguish between methods" is a major contribution due to the limitation of the generalization of the results as stated in the last paragraph of section V “External Validity”. 5. Regarding the NEAREST neighbor method i, the two statements: 1) "…(NEAREST neighbor method) is inappropriate for sparse data sets [32]." 2) "…(NEAREST neighbor method) better suited to sparse data domains…[30]." Are contradictory, and it is not clear how the conclusion “…it is not surprising that method I performs poorly on our data” was drawn. How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Very Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: Partially 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: Yes 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: References are sufficient and appropriate 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Readable - but requires some effort to understand Please rate the manuscript. Explain your rating under the Public Comments section below.: Good Reviewer: 3 Recommendation: Reject Comments: Major Problems 1. Misleading summation of different papers. The three papers on which the authors base their paper (i.e. Kitchenham et al., Foss et al., and Myrtviet et al.) all make very different points and their results have different implications: Kitchenham et al point put that MMRE is a measure of spread (standard deviation) whereas pred(N) is a measure of kurtosis. The implication is that since they measure different properties they may give contradictory results. Foss et al. demonstrate that both MMRE and MedianMRE are extremely biased and will prefer a model that consistently overestimates to one that is unbiased with respect to under and over estimation. The implications of this are that techniques that optimise with respect to MMRE (such as some data-intensive analogy methods) will be untrustworthy. Note. The fact that the problem still occurs with the Median suggests that it is NOT just a problem of large outliers. (See below for the most likely explanation) Myrveit et al. were concerned with comparisons of different types of model (e.g. regression v. machine learning. This paper is thus more relevant to this study than the other two papers. They confirm that machine learning based on MMRE will cause MMRE accuracy metrics to favour machine learning methods. They also concluded that the decision about which model is best was to a certain extent determined by the sample. Note they did not test whether the actual result on a single sample would support the null hypothesis. They only report comparing the sample accuracy metrics. Treating these three papers as equivalent is incorrect. 2. Not considering another interpretation of results The term “conclusion instability” is misleading. With a valid measure of estimation accuracy (avoiding ratio based metrics), “conclusion instability” is not necessarily a problem, it may simply be caused by the null hypothesis being true. If the null hypothesis (of no significant difference) is true for two modelling processes and a specific data set, taking different partitions of the data set for test and validation will always lead to some occasions when one model has a better accuracy statistics than the other and vice-versa. 3. Not considering the properties of the ratios of two variables The fact that the authors found “large and small instability (i.e. MRE standard deviation) for both large and small data sets” may be due to an underlying problem with ratio metrics. The ratio of two random variables cannot always be specified theoretically but we know that in some cases a ratio of two variables is a Cauchy distribution (i.e. both the ratio of two standard normal variables and the t-distribution with one degree of freedom are distributed as Cauchy variables). A property of the Cauchy distribution is that it has no moments – this means that the mean and variance are undefined and there will be no convergence towards a mean or variance as sample sizes increase. It appears that the MRE is exhibiting properties consistent with a Cauchy distribution. This is a separate issue from so-called “conclusion instability”, and is not a property of non-ratio metrics such as those based on residuals e.g. the absolute residuals or squared residuals. Other issues 1. It is unscientific to suggest the results of a well-conducted systematic review such as Jørgensen’s is incorrect without reviewing each primary study identified in Jørgensen’s review and confirming that the process used in each individual study was invalid. The best that could be said is that problems with the MMRE may have biased some of the results of the primary studies. 2. The use of the Mann Whitney test assumes that the two sets of values are independent. However, if you use the same set of partition of the data to construct two models and then make predictions of the same set of validation projects, it is difficult to argue that you have independent data points in each sample. A paired test would appear to be preferable. 3. If the authors are not going to use the results of a statistical test, and intend to base their assessment just on a summary statistic they might as well use the median (preferably of the squared or absolute residual) because the median is a robust measure of central tendency. 4. I applaud the authors attempt only to use publicly available data, but the original COCOMO dataset is a bad choice because it does not correspond to the COCOMO model. For example, the Timing data includes adjustment factor values that are not found in the Intermediate COCOMO model. The authors do not explain what they have done about the inconsistencies in the COCOMO data. How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: No 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: Yes 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: References are sufficient and appropriate 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Easy to read Please rate the manuscript. Explain your rating under the Public Comments section below.: Poor