\begin{verbatim} Editor Comments to the Author: Thank you for your re-submitted paper for TSE. I am delighted to report that all three referees recommend acceptance and two of these require no further changes. However, you will notice that one of the referees does indicate some changes. This referee has some concerns listed under the heading "Major Comments" which the referee is particularly concerned about. Following a discussion with this referee concerning the balance of comments between major and minor revisions made in the review, it was agreed that the overall recommendation from this referee was "minor revisions" (as opposed to something more serious). \end{verbatim} \begin{verbatim} While I feel confident as an associate editor in determining whether the more minor comments from this review have been addressed, I may need to pass the paper back to this referee to check that they are satisfied with the treatment and response in relation to those comments and requests listed under "major comments". I think that it is right that I should do this, but also, clearly very important that I should alert you to the possibility that the manuscript may go back to this referee for comment and that you should, therefore, be careful to address these concerns and to write a response that explains that manner in which you have done so. \end{verbatim} \begin{verbatim} I realise that this process has been a rather long one, but I hope that you will be able to address these comments and that, as a result I will then be in a position to recommend final acceptance of this paper. I believe that the paper makes a valuable contribution to work on search based software engineering and creates important links between SBSE/SBST and other approaches to test data generation (such as DART). \end{verbatim} \begin{verbatim} I would like to add one additional request that you consider whether or not it might be appropriate to add some mention of search based software engineering in the related work section; It seemed rather odd to me that the related work section had no mention of the fact that there is a large body of work now accruing on the application of SBSE to testing. I would not want to abuse my position as AE in any way and so I would make it clear that I do not make this a requirement of acceptance, but I would like to raise this as a point for you to consider. \end{verbatim} \begin{verbatim} Once again, thank you very much for your contribution and your interest in TSE \end{verbatim} \begin{verbatim} Reviewers' Comments \end{verbatim} \begin{verbatim} Reviewer: 1 \end{verbatim} \begin{verbatim} Recommendation: Author Should Prepare A Minor Revision \end{verbatim} \begin{verbatim} Comments: In general, I like this paper, but there is still some (possibly) minor work to do. The section below called 'Major Comments' identifies my major concerns. \end{verbatim} \begin{verbatim} Response to Previous Comments ------------------------------------ \end{verbatim} \begin{verbatim} The authors have also largely addressed the majority of the comments I had on the previous revision. Some commentary on this: \end{verbatim} \begin{verbatim} (i) The 'spikes' of Fig. 1: I don't accept that the spikes are a *correct* consequence of using integers; instead they are an *incorrect* artefact of attempting to join discrete integer points by lines in the plot. However, this is a minor point and attempting to remove the lines could result in less understandable figures. As it stands the important argument - that reformulating the problems in terms of a probabilistic range of values gives a guiding gradient to some part of the search space - is still made, and so no further changes need be made to this figure. \end{verbatim} \begin{verbatim} (ii) Serious consideration should still be given to how to improve figures 10, 12 and 15. The lines for many classes are indistinguishable, in which case why label the lines at all by a legend? Simply giving more space to the figures may help. (See comments below about simplifying the text of section IV which may save space which could be used for this purpose.) \end{verbatim} \begin{verbatim} (iii) With respect of the empirical work of (now) section IV-D, I don't think the fact of 16 SUTs avoids the argument that basing the analysis on single runs for each SUT/number of eliminated gene types combination lack robustness. However, this is not a major point. \end{verbatim} \begin{verbatim} \end{verbatim} \begin{verbatim} Major Comments ------------------- \end{verbatim} \begin{verbatim} Nevertheless, there are still some major issues to be addressed: \end{verbatim} \begin{verbatim} (1) Sections I, II and III read very well and describe the Nighthawk system at an appropriate level of detail. However, section IV (on FSS) is significantly less easy to understand. It is arguably too long: the key arguments could have been made with less text, and omitting some of the experimental detail. For example, while the identification of the change to the initialisation of numberOfCalls is an important outcome, is it necessary to describe results for java.util in detail prior to this change? (Also, the Apache Common Collections results - Fig 15 - are described after this change, why is this not done for java.util also - in place of Fig 12?) Similarly, the rejection of bestMerit and bestRank in favour of avgMerit and avgRank could be summarised briefly. \end{verbatim} \begin{verbatim} (2) The arguments in favour of using FSS are somewhat confused, possibly by using too many different metrics/experiments. While the result that using 1 rather than 10 gene types result only a marginal decrease in coverage / AUC (Fig 10/Fig 11) is meaningful, the argument that when using one 1 gene type, 90% can be achieved in 10% of the time of the `full' Nighthawk is less convincing. The actual metric for time isn't stated clearly, but if we assume that number of generations are a broadly similar measure, then doesn't Figure 10 also suggest a similar result occurs when using the `full' Nighthawk of all 10 gene types: i.e. it reaches about 90% of its final value after about 10% of the generations? In other words, the ability to achieve 90% of the best coverage in 10% of the time isn't clearly shown to be a property of using FSS, but rather a property of the algorithm as a whole (which is retained when applying FSS). In which case, what is the key advantage of using FSS in this context? \end{verbatim} \begin{verbatim} (3) The results suggested that numberOfCalls was initialised inappropriately, and that once this was changed, it was no longer an important gene type. What evidence is there that this is now also true of the gene types now considered most significant in NightHawk 1.1. In other words, is FSS just identifying those initial values that are set inappropriately rather than those that are genuinely important throughout the search? (I don't think this needs necessarily to be answered via experimentation, but it should be acknowledged.) \end{verbatim} \begin{verbatim} (4) Section V has a title 'Properties of the Optimized System' following on from the previous section IV titled 'Analysis and Optimization of the GA' that applied FSS. To me, this suggests that FSS was applied to the algorithm for the results in section V, but this does not appear to be the case (and if it was, it is not clearly stated). The only reference to FSS is for the Apache Common Collections were *only* 2 gene types were omitted, and no reference to applying FSS is made for java.util. If FSS were not applied more aggressively in these concluding experiments, then why consider FSS in the first place? \end{verbatim} \begin{verbatim} \end{verbatim} \begin{verbatim} Minor Comments ------------------- \end{verbatim} \begin{verbatim} (a) Keywords: keyword B.7.3.e (Test generation) refers to *hardware* not *software* testing \end{verbatim} \begin{verbatim} (b) p3, line 24 "In previous research, value reuse has mostly taken the form of making a sequence of method calls all on the same receiver object" The *implication* is that this work considers different receiver objects, in which case this could be stated for clarity. \end{verbatim} \begin{verbatim} (c) p5, line 15 "... and and the question ..." \end{verbatim} \begin{verbatim} (d) p15, line 13 "(the additional coverage .... is probably > 0)" - not sure of the justification nor relevance of this line (since it is clear which SUT is being used - the triangle program) - suggest omitting it \end{verbatim} \begin{verbatim} (e) p17, line 53 "further research work ... such as simulated annealing" - arguably wouldn't the SA also be using "useless genes" like the SA. No change suggested, just making the point that a comparison could still be made without applying FSS \end{verbatim} \begin{verbatim} (f) p19,line 24 "the fewer are the demands on interfaces to the external environment" - unclear what this means in this context \end{verbatim} \begin{verbatim} (g) section IV.B might be easier to understand if a brief explanation of what the groups and instances of RELIEF map to in the case of Nighthawk. \end{verbatim} \begin{verbatim} (h) p20, line 39 "For the two BitSet gene types, we printed only the cardinality of the set." Why? \end{verbatim} \begin{verbatim} (i) p21, line 20 If I understand correctly "gene type" here does NOT refer to class/primitive types, but instead to the 10 categories of genes listed in Fig 6. In which case "categories" or similar might have been a better choice of terminology to avoid confusion with class/primitive types (but a wholesale change of terminology is not required at this point). Similarly, the notation t has previously been used to refer to class/primitive types. \end{verbatim} \begin{verbatim} (j) p21, line 47 "version 1.0 of Nighthawk" - while the meaning of 1.0 is explained later, this is the first reference to the versions and ideally the relevance could be explained \end{verbatim} \begin{verbatim} (k) p24, line 52 "does show a statistically significant difference ..." suggest adding "at the 5% level" \end{verbatim} \begin{verbatim} (l) p24 line 55 what metric is used for "time taken", and why doesn't AUC adequately cover both coverage and time taken (using generations as the measure of time taken). In other words, does fixing gene types to constant significantly change the runtime? (If so, may be worth giving some illustration of the magnitude of such changes.) \end{verbatim} \begin{verbatim} (m) p29, line 34 This paragraph doesn't really make sense (what is "rapid turnover"?), and may make too strong claims based on limited results. \end{verbatim} \begin{verbatim} (n) p30, line 39 "Nighthawk v1.0" - small point but this is inconsistent with section IV where "v" is not used for the version \end{verbatim} \begin{verbatim} (o) p30, line 50 The statistical analysis is now appropriate and to be commended (although, having applied a Wilcoxon test, a SW and t-test are arguably unnecessary). \end{verbatim} \begin{verbatim} Reviewer: 3 \end{verbatim} \begin{verbatim} Recommendation: Accept With No Changes \end{verbatim} \begin{verbatim} Comments: The authors describe a hybrid system which uses a Genetic Algorithm (GA) to optimize settings for a random unit testing tool (for JAVA). To improve the performance of the GA, the authors propose to use Feature Subset Selection (FSS) to only focus the GA on the most useful genes. \end{verbatim} \begin{verbatim} Overall the manuscript is very well written and easy to follow. The idea of using FSS to reduce the size of the search space is indeed very interesting and it would be very interesting to see if it can be applied to GA's used to generate test data directly. \end{verbatim} \begin{verbatim} I only have a few minor comments. The authors state that, in their original work, the FSS was only able to eliminated 60% of the gene types, while in this submission they can eliminate 90%. If I am not mistaken, this equates to eliminating 9 out of the 10 gene types used, thus only focusing on one feature: the number of method calls to perform. In a way one can guess that this feature is important since the authors state on page 6 that they are generating long sequences of method calls. (If long sequences are really preferable to shorter ones achieving the same coverage is a matter for debate). My concerns about the new FSS analysis are the following. The authors state that the reason why the original Nighthawk underperformed was because of a "bad" initial choice for the value for this feature (numberOfCalls). They then go on to argue that, once a better initial value has been picked, the GA optimization does not statistically significantly affect either coverage or efficiency of the Nighthawk system (page 30). So why do the authors claim (on page 26) that changing the initial value of "numberOfCalls" results in a "very different" behaviour of the Nighthawk system, resulting in two different versions? The second concern I have is that if 90% of the features can be eliminated, does Nighthawk still need a GA at all? What would the level of coverage for Nighthawk be if no GA was used at all and Nighthawk was run with the more optimal setting of 50 for "numberOfCalls"? \end{verbatim} \begin{verbatim} Other minor comments: Page 10: It's not the case that previous GA based approaches necessarily result in fixed sized test suites. For example, the work by Wegener et al. (IST, 43(14) 841-854) 2001) or McGraw et al (TSE, 27(12) 1085-1110 2001) takes "serendipitous" coverage into account. This means that branches which are covered "by accident" as the GA tries to optimize a target branch are removed from the list of targets and the GA does not attempt to cover them again. Since the GA is a stochastic algorithm, during different runs, it may end up generating different sized test suites. \end{verbatim} \begin{verbatim} Page 23: At the bottom: "The thick black curves on those figures..." -> There seems to be only one figure with a thick black curve? \end{verbatim} \begin{verbatim}