Subject: IEEE Software , SWSI-0089-0605 major revision required From: software@computer.org Date: Thu, 14 Jul 2005 14:42:12 -0400 (EDT) To: timm@cs.pdx.edu IEEE Software,SWSI-0089-0605 manuscript type: Special issue on Best Papers from 2005 PROMISE Workshop "Software Cost Models: When Less is More" Dear Dr. Menzies, The review process for the above-referenced manuscript is complete. After carefully examining the manuscript and reviews, the editor in chief has decided that the manuscript needs major revisions before it can be considered for a second review. We encourage you to revise your manuscript and resubmit it. If you are planning to send a revision, please do so by August 8, while we realize the time is short for a major revision, becuase of the nature of special issues, we must have your revision by then in order to allow time for the second round of reviews and still making the production deadline. Please maintain our 5,400 word limit as you make your revisions. The reviewer comments are attached below. If necessary, please refer to the instructions on how to upload the revision included below. If you have any questions regarding our policies or procedures, please refer to the IEEE Software Author Center found at http://www.computer.org/mc/software/author.htm. We look forward to receiving your revised manuscript. Regards, Hilda Hosillos Magazine Assistant IEEE Software software@computer.org ======================================== Author Instructions You should upload your revision and summary of changes to http://cs-ieee.manuscriptcentral.com by doing the following: 1) Log on to your Author Center. 2) Scroll down to the section called 'Manuscripts to be Revised'. 3) Click on the View Comments/Respond button of the paper with your original log number (SWSI-0089-0605). The '.R1' indicates that it is a revision. 4) Scroll down to the very bottom of the screen to the free-text boxes called 'Response to Editor:' and 'Responses to Reviewer:'. Paste in your responses to the editor and reviewers here. If you prefer to upload a file containing your responses, please indicate this information in the free-text box. 5) Click on the 'Save Response' button. 6) Click on the title link to upload the new revision. Click on OK to continue. 7) This will take you the File upload screen - Screen 10 of 12. You may click on the 'Previous' button if you need to make changes to the title, names of contributing authors, abstract, etc. 8) After you have uploaded the file(s) on Screen 11, proceed to Screen 12. 9) Click on the 'Submit Your Manuscript' button. IMPORTANT - If you do not receive a system generated message confirming the successful uploading of your revision, then the upload was not complete. Contact the magazine assistant (software@computer.org) immediately for assistance. ======================================== Editor comments Two reviewers recommend a major revision. It appears their reason is the fact that the paper is not clearly written. The topic is very relevant but the message is cryptic. Needs a major revision to ensure that IEEE Software readers can benefit from the article. ======================================== Reviews Reviewer 1 Section I. Overview A. Reader Interest 1. How relevant is this manuscript to the readers of this periodical? Please explain your rating in the Confidential Comments section. ( ) Very Relevant (X) Relevant ( ) Interesting - but not very relevant ( ) Irrelevant B. Content 1. Please summarize what you view as the key point(s) of the manuscript and the importance of the content to the readers of this periodical. Biggest problem with this paper is that the data set sample sizes are much too small. As a result, it skews the results presented in Figure 1. A review of the t03 data set illustrates the problem. Assuming 66% of the data is allocated to training and 33% to testing, then 7 tuples are used to build the model and 3 are used to test the model. Figure 1 states that 28% and 62% of the results (I assume this only refers to the test data) are within pred(30) for the "before" and "after" respectively. For 3 test tuples the 28% maps to roughly 1 out of 3 and the 62 maps to 2 out of 3. Although it is a gain of 221% (relative) it only corresponds to a gain of 1 project (absolute). Furthermore, the "after" column uses results from the best reduction level. A more realistic approach would show "before" again a specific level (e.g. FS02). It is interesting to note that if c01..c03, p02..p03, and t02..t03 are consolidated by using a weighted average. Then, the results from Figure 1 look as follows: # PROJECTS "After/Before" Column 24 341% (t02..t03) 44 500% (p02..p03) 56 221% (c01..c03) 60 120% 63 116% 119 106% 161 101% Excluding the p0X results, the impact of feature reduction diminishes as the sample size grows. This reviewer recommends that this paper be accepted provided that c01..c03, p02..p03, and t02..t03 are consolidated and results are presented in both relative and absolute terms. If "before" is 10% and after is 15%, then relative improvement is 50% ((15% - 10%)/10%) and absolute improvement is 5% (15% - 10%). I believe this will address many of this concerns mentioned below. 2. Is the manuscript technically sound? Please explain your answer in the Confidential Comments section. ( ) Yes ( ) Appears to be - but didn't check completely (X) Partially ( ) No 3. What do you see as this manuscript's contribution to the literature in this field? One of the biggest challenges in Empirical Software Engineering is building models with little or no data. This paper demonstrates that good models can be built using less data. 4. What do you see as the strongest aspect of this manuscript? 1) Data mining implicitly/explicitly starved Software Engineering domains is a challenge. The paper presents an interesting technique for addressing data starvation through feature reduction. It would be amusing if a future version of this paper ends up discarding COCOMO X in favor of some SLOC count. 2) Paper is easy to read. Ideas flow nicely. 3) Bottom paragraph on page 2 (How this paper extends previous work) is a very nice touch. Sets a nice context for this paper. 5. What do you see as the weakest aspect of this manuscript? 1) The paper argues when less is "NOT" more on page 3. If the paper is assuming a "data mining" applied to "Software Engineering" context, then reason 1 does not make much sense. If no data is available, then there is no form of algorithmic and/or machine learning modeling. 2) Figure 1 shows impressive results especially with data sets 5 through 12. However, sample size in data sets 5 though 10 are quite small. What about merging c01..c03; p02..p04; and t02..t03? This would yield c0X = 56, p0X = 48, and t0X = 24. 3) Figure 1 divides results into 2 subtables. What was the reasoning for this partition? Is it based on sample size (larger samples in top subtable)? Or is it based on public versus private data? 4) Also the aggregate results in Figure 1 (row with the word "mean") uses a simple average. Since there is a relatively broad range of sample sizes, this reviewer recommends using a weighted average. 5) On page 2 the authors claim "If experience can tell us when to add variables, it should also be able to tell us when to subtract variables." It is assumed that "experience" refers to human-based experience. If so, then is it really necessary to build statistical/ML-based models? 6) Figure 2 is a bit confusing. It does not seem to add much value to the paper. This reviewer suggests removing the figure. It seems that the authors are proposing how to incorporate variable reduction into the data mining process. 7) In section 3 ("Why Subtract Variables"), the authors provide business reasons why it is important to subtract variables. In the second bullet, the authors provide an example of assessing competitive bids. There are two concerns regarding this example: A) It is an obvious example in the Time and Money will be the most prominent driving forces. B) This data mining type of problem is a lot easier than the case study presented in the paper. The reason for this claim is that the second bullet is an assessment type of problem as opposed to a predictor type of problem. 8) In section 3 ("Why Subtract Variables"), the authors argue for subtract variable because of "Irrelevancy." Essentially, this is a Type I, Type II issue (Only throw away the irrelevant, but keep the significant). Jumping ahead to Figure 5, there is a blur between relevant and irrelevant. That is, the paper argues the irrelevancy issue, but doesn't deliver consistent results in Figure 5. 9) In section 3 ("Why Subtract Variables"), the authors argue for subtracting variables because of "Under-sampling." A) For the paper "in preparation," the authors might want to consider the question, "How many features are enough?" That is, how does sample size drive feature set size. B) The argument made on page 6, first paragraph, assumes an equal distribution of the variables, which is not normally the case. Thus, the 88% and 3.5% results are "worst case scenarios." C) Figure 3 suggests a technique which combines Feature reduction (theme of this paper) with instance reduction (reduce cplx in Figure 3 from 5 instances to 3 instances). normal 10) On page 8, equation 1, claims that EMi has 15 effort multipliers (which is true for COCOMO I). Since some of your data is COCOMO II data, you might want to mention that the latter version has 17 effort multipliers. 11) Equation 3 raises an interesting question about feature reduction. Since many of the terms contain "Size" then won't it be less likely that "Size" would be removed? (In this case loc.) 12) The paper argues for using the Wrapper technique for producing better effort estimation predictions. However, if I am a project manager, how far do I reduce? Figure 5 does not shed any light on this since "best results" are achieved anywhere from FS01 (t02) all the way through FS07 (p04). The paper does not plot the lines (in Figure 5) all the way to FS07 for all the lines. Thus, as a project manager, I do not know the consequences of extending the reduction out to FS07. Relating this observation back to reason number 2 on page 3 (the "less is NOT more" section) a user will be unable to trust this approach since he/she will be unable to when to stop reducing. 13) It is noted that the plots (bottom graph of Figure 5) are not monotonic. This inconsistency raises questions about how far to reduce. Perhaps the authors may wish to include a confidence factor. That is, a reduction to FS01 will improve effort prediction X percent of the time. 14) The paper talks about performing 30 hold-out experiments. What was the distribution of training to test samples. Page 11 of the paper implies a 2/3, 1/3 ratio. Is this correct? If so, then for data sets 5 through 10 there are 10 or fewer samples in the training set and 5 or fewer samples in the test set. This argues for the consolidation of data sets. 15) Also, if an experiment is run 30 times per data set per feature reduction, then why not run a t-test on data set X for feature levels N and N+1. It would be possible to claim that level N+1 produces statistically superior results to level N (for data set X). 16) In section 5, Related Work, the authors refer to Kirsopp & Shepperd as "K&S" more than once. This is rather informal probably not suitable for a journal article. 17) The authors converted the answers using natural logs. Were the answers converted back prior to measuring with pred(30)? If not, all the results are greatly distorted. C. Presentation 1. Are the title, abstract, and keywords appropriate? Please elaborate in the Confidential Comments section. (X) Yes ( ) No 2. Does the manuscript contain title, abstract, and/or keywords? (X) Yes ( ) No 3. Does the manuscript contain sufficient and appropriate references? Please elaborate in the Confidential Comments section. (X) References are sufficient and appropriate ( ) Important references are missing; more references are needed ( ) Number of references are excessive 4. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer in the Confidential Comments section. ( ) Yes (X) Could be improved ( ) No 5. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please elaborate in the Confidential Comments section. (X) Satisfactory ( ) Could be improved ( ) Poor 6. Is the manuscript focused? Please elaborate in the Confidential Comments section. (X) Satisfactory ( ) Could be improved ( ) Poor 7. Is the length of the manuscript appropriate for the topic? Please elaborate in the Confidential Comments section. (X) Satisfactory ( ) Could be improved ( ) Poor 8. Please rate and comment on the readability of this manuscript in the Confidential Comments section. (X) Easy to read ( ) Readable - but requires some effort to understand ( ) Difficult to read and understand ( ) Unreadable Section II. Summary and Recommendation A. Evaluation Please rate the manuscript. Explain your choice in the Confidential Comments section. ( ) Award Quality ( ) Excellent (X) Good ( ) Fair ( ) Poor B. Recommendation Please make your recommendation and explain your decision in the Detailed Comments section. ( ) Accept with no changes (X) Accept if certain minor revisions are made ( ) Author should prepare a major revision ( ) Reject Section III. Detailed Comments A. Public Comments (these will be made available to the author) Title: "Software Cost Models: When Less is More" Authors: Chen, Menzies, Port, and Boehm OVERALL Biggest problem with this paper is that the data set sample sizes are much too small. As a result, it skews the results presented in Figure 1. A review of the t03 data set illustrates the problem. Assuming 66% of the data is allocated to training and 33% to testing, then 7 tuples are used to build the model and 3 are used to test the model. Figure 1 states that 28% and 62% of the results (I assume this only refers to the test data) are within pred(30) for the "before" and "after" respectively. For 3 test tuples the 28% maps to roughly 1 out of 3 and the 62 maps to 2 out of 3. Although it is a gain of 221% (relative) it only corresponds to a gain of 1 project (absolute). Furthermore, the "after" column uses results from the best reduction level. A more realistic approach would show "before" again a specific level (e.g. FS02). It is interesting to note that if c01..c03, p02..p03, and t02..t03 are consolidated by using a weighted average. Then, the results from Figure 1 look as follows: # PROJECTS "After/Before" Column 24 341% (t02..t03) 44 500% (p02..p03) 56 221% (c01..c03) 60 120% 63 116% 119 106% 161 101% Excluding the p0X results, the impact of feature reduction diminishes as the sample size grows. This reviewer recommends that this paper be accepted provided that c01..c03, p02..p03, and t02..t03 are consolidated and results are presented in both relative and absolute terms. If "before" is 10% and after is 15%, then relative improvement is 50% ((15% - 10%)/10%) and absolute improvement is 5% (15% - 10%). I believe this will address many of this concerns mentioned below. STRENGTHS 1) Data mining implicitly/explicitly starved Software Engineering domains is a challenge. The paper presents an interesting technique for addressing data starvation through feature reduction. It would be amusing if a future version of this paper ends up discarding COCOMO X in favor of some SLOC count. 2) Paper is easy to read. Ideas flow nicely. 3) Bottom paragraph on page 2 (How this paper extends previous work) is a very nice touch. Sets a nice context for this paper. WEAKNESSESS 1) The paper argues when less is "NOT" more on page 3. If the paper is assuming a "data mining" applied to "Software Engineering" context, then reason 1 does not make much sense. If no data is available, then there is no form of algorithmic and/or machine learning modeling. 2) Figure 1 shows impressive results especially with data sets 5 through 12. However, sample size in data sets 5 though 10 are quite small. What about merging c01..c03; p02..p04; and t02..t03? This would yield c0X = 56, p0X = 48, and t0X = 24. 3) Figure 1 divides results into 2 subtables. What was the reasoning for this partition? Is it based on sample size (larger samples in top subtable)? Or is it based on public versus private data? 4) Also the aggregate results in Figure 1 (row with the word "mean") uses a simple average. Since there is a relatively broad range of sample sizes, this reviewer recommends using a weighted average. 5) On page 2 the authors claim "If experience can tell us when to add variables, it should also be able to tell us when to subtract variables." It is assumed that "experience" refers to human-based experience. If so, then is it really necessary to build statistical/ML-based models? 6) Figure 2 is a bit confusing. It does not seem to add much value to the paper. This reviewer suggests removing the figure. It seems that the authors are proposing how to incorporate variable reduction into the data mining process. 7) In section 3 ("Why Subtract Variables"), the authors provide business reasons why it is important to subtract variables. In the second bullet, the authors provide an example of assessing competitive bids. There are two concerns regarding this example: A) It is an obvious example in the Time and Money will be the most prominent driving forces. B) This data mining type of problem is a lot easier than the case study presented in the paper. The reason for this claim is that the second bullet is an assessment type of problem as opposed to a predictor type of problem. 8) In section 3 ("Why Subtract Variables"), the authors argue for subtract variable because of "Irrelevancy." Essentially, this is a Type I, Type II issue (Only throw away the irrelevant, but keep the significant). Jumping ahead to Figure 5, there is a blur between relevant and irrelevant. That is, the paper argues the irrelevancy issue, but doesn't deliver consistent results in Figure 5. 9) In section 3 ("Why Subtract Variables"), the authors argue for subtracting variables because of "Under-sampling." A) For the paper "in preparation," the authors might want to consider the question, "How many features are enough?" That is, how does sample size drive feature set size. B) The argument made on page 6, first paragraph, assumes an equal distribution of the variables, which is not normally the case. Thus, the 88% and 3.5% results are "worst case scenarios." C) Figure 3 suggests a technique which combines Feature reduction (theme of this paper) with instance reduction (reduce cplx in Figure 3 from 5 instances to 3 instances). normal 10) On page 8, equation 1, claims that EMi has 15 effort multipliers (which is true for COCOMO I). Since some of your data is COCOMO II data, you might want to mention that the latter version has 17 effort multipliers. 11) Equation 3 raises an interesting question about feature reduction. Since many of the terms contain "Size" then won't it be less likely that "Size" would be removed? (In this case loc.) 12) The paper argues for using the Wrapper technique for producing better effort estimation predictions. However, if I am a project manager, how far do I reduce? Figure 5 does not shed any light on this since "best results" are achieved anywhere from FS01 (t02) all the way through FS07 (p04). The paper does not plot the lines (in Figure 5) all the way to FS07 for all the lines. Thus, as a project manager, I do not know the consequences of extending the reduction out to FS07. Relating this observation back to reason number 2 on page 3 (the "less is NOT more" section) a user will be unable to trust this approach since he/she will be unable to when to stop reducing. 13) It is noted that the plots (bottom graph of Figure 5) are not monotonic. This inconsistency raises questions about how far to reduce. Perhaps the authors may wish to include a confidence factor. That is, a reduction to FS01 will improve effort prediction X percent of the time. 14) The paper talks about performing 30 hold-out experiments. What was the distribution of training to test samples. Page 11 of the paper implies a 2/3, 1/3 ratio. Is this correct? If so, then for data sets 5 through 10 there are 10 or fewer samples in the training set and 5 or fewer samples in the test set. This argues for the consolidation of data sets. 15) Also, if an experiment is run 30 times per data set per feature reduction, then why not run a t-test on data set X for feature levels N and N+1. It would be possible to claim that level N+1 produces statistically superior results to level N (for data set X). 16) In section 5, Related Work, the authors refer to Kirsopp & Shepperd as "K&S" more than once. This is rather informal probably not suitable for a journal article. 17) The authors converted the answers using natural logs. Were the answers converted back prior to measuring with pred(30)? If not, all the results are greatly distorted. Reviewer 2 Section I. Overview A. Reader Interest 1. How relevant is this manuscript to the readers of this periodical? Please explain your rating in the Confidential Comments section. (X) Very Relevant ( ) Relevant ( ) Interesting - but not very relevant ( ) Irrelevant B. Content 1. Please summarize what you view as the key point(s) of the manuscript and the importance of the content to the readers of this periodical. An example of using a software tool to eliminate variables in order to improve the predicitive capabilities of the dataset. 2. Is the manuscript technically sound? Please explain your answer in the Confidential Comments section. ( ) Yes (X) Appears to be - but didn't check completely ( ) Partially ( ) No 3. What do you see as this manuscript's contribution to the literature in this field? Rather than continually collect more data, it says you can stop sometimes and eliminate unnecessary data. 4. What do you see as the strongest aspect of this manuscript? Use of algorithm to evaluate effectiveness of data collection 5. What do you see as the weakest aspect of this manuscript? Section 4 on the case study. It is too dense. Too much is missing. I have no idea how to interpret Figure 5 which is the mainr esult. I am not sure what the variable listed in Fuigure 4 means. In order to fit within IEEE Software length guidelines, too much of understanding was removed in the paper. The supporting paper with the manuscript is not relevant. If I need that paper to understand the current paper, then there is no need for the current paper. Section 4 needs to be rewritten and made more understandable as to what is going on. C. Presentation 1. Are the title, abstract, and keywords appropriate? Please elaborate in the Confidential Comments section. (X) Yes ( ) No 2. Does the manuscript contain title, abstract, and/or keywords? (X) Yes ( ) No 3. Does the manuscript contain sufficient and appropriate references? Please elaborate in the Confidential Comments section. (X) References are sufficient and appropriate ( ) Important references are missing; more references are needed ( ) Number of references are excessive 4. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer in the Confidential Comments section. (X) Yes ( ) Could be improved ( ) No 5. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please elaborate in the Confidential Comments section. (X) Satisfactory ( ) Could be improved ( ) Poor 6. Is the manuscript focused? Please elaborate in the Confidential Comments section. (X) Satisfactory ( ) Could be improved ( ) Poor 7. Is the length of the manuscript appropriate for the topic? Please elaborate in the Confidential Comments section. ( ) Satisfactory (X) Could be improved ( ) Poor 8. Please rate and comment on the readability of this manuscript in the Confidential Comments section. ( ) Easy to read ( ) Readable - but requires some effort to understand (X) Difficult to read and understand ( ) Unreadable Section II. Summary and Recommendation A. Evaluation Please rate the manuscript. Explain your choice in the Confidential Comments section. ( ) Award Quality ( ) Excellent (X) Good ( ) Fair ( ) Poor B. Recommendation Please make your recommendation and explain your decision in the Detailed Comments section. ( ) Accept with no changes ( ) Accept if certain minor revisions are made (X) Author should prepare a major revision ( ) Reject Section III. Detailed Comments A. Public Comments (these will be made available to the author) Section 4 - the central theme of the paper - is unreadable. The topic is important, but this version is not readable to the general IEEE Software reader. Reviewer 3 Section I. Overview A. Reader Interest 1. How relevant is this manuscript to the readers of this periodical? Please explain your rating in the Confidential Comments section. ( ) Very Relevant ( ) Relevant (X) Interesting - but not very relevant ( ) Irrelevant B. Content 1. Please summarize what you view as the key point(s) of the manuscript and the importance of the content to the readers of this periodical. When using the general(ized) cost estimation model COCOMO, the number of its parameters could be reduced to a number of significant parameters, by learning from historical data for sets of projects. 2. Is the manuscript technically sound? Please explain your answer in the Confidential Comments section. ( ) Yes (X) Appears to be - but didn't check completely ( ) Partially ( ) No 3. What do you see as this manuscript's contribution to the literature in this field? Suggests that for COCOMO to make accurate estimations, a reduced number of parameters might be better than using the entire set of parameters, given that there is historical data for machine learners to determine which parameters can be eliminated. 4. What do you see as the strongest aspect of this manuscript? The idea that under given circumstances, using less parameters for COCOMO not only leads to same results, but also can improve the accuracy of its estimation. 5. What do you see as the weakest aspect of this manuscript? It does not target the audience of the "Software" magazine. C. Presentation 1. Are the title, abstract, and keywords appropriate? Please elaborate in the Confidential Comments section. ( ) Yes (X) No 2. Does the manuscript contain title, abstract, and/or keywords? (X) Yes ( ) No 3. Does the manuscript contain sufficient and appropriate references? Please elaborate in the Confidential Comments section. (X) References are sufficient and appropriate ( ) Important references are missing; more references are needed ( ) Number of references are excessive 4. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer in the Confidential Comments section. (X) Yes ( ) Could be improved ( ) No 5. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please elaborate in the Confidential Comments section. ( ) Satisfactory (X) Could be improved ( ) Poor 6. Is the manuscript focused? Please elaborate in the Confidential Comments section. ( ) Satisfactory (X) Could be improved ( ) Poor 7. Is the length of the manuscript appropriate for the topic? Please elaborate in the Confidential Comments section. ( ) Satisfactory (X) Could be improved ( ) Poor 8. Please rate and comment on the readability of this manuscript in the Confidential Comments section. ( ) Easy to read ( ) Readable - but requires some effort to understand (X) Difficult to read and understand ( ) Unreadable Section II. Summary and Recommendation A. Evaluation Please rate the manuscript. Explain your choice in the Confidential Comments section. ( ) Award Quality ( ) Excellent ( ) Good (X) Fair ( ) Poor B. Recommendation Please make your recommendation and explain your decision in the Detailed Comments section. ( ) Accept with no changes ( ) Accept if certain minor revisions are made (X) Author should prepare a major revision ( ) Reject Section III. Detailed Comments A. Public Comments (these will be made available to the author) Section I.C.1 The title implies that more than one cost estimation models was used in this experiment. The paper only refers to COCOMO. Reconcile this issue. The paper has potential but right now it reads as if it was written for datamining specialists. It should be re-written for the readers of the "Software" magazine. That means that less details should be given about the machine learners and more about why this work is important for a user of the COCOMO model. How would a software project manager use COCOMO any differently due to your proposed approach and what does that buy him/her? What does he/she need to do or have e.g., what type of historical data about projects? Do they have to be related projects to the one at hand? How much related? Try to present this work from the point of view of the project manager - more like a black box description rather than a white box one. How one would use this tool, rather than the internals of the tool itself. In addition, to improve readability: - correct typing errors - place figures right after their first reference in text - distinguish between figures and tables - if tables and figures help understanding, then use them, but explain them better