\begin{verbatim}
Editor Comments to the Author:
Thank you for your re-submitted paper for TSE. I am delighted to report
that all three referees recommend acceptance and two of these require no
further changes. However, you will notice that one of the referees does
indicate some changes. This referee has some concerns listed under the
heading "Major Comments" which the referee is particularly concerned
about. Following a discussion with this referee concerning the balance of
comments between major and minor revisions made in the review, it was
agreed that the overall recommendation from this referee was "minor
revisions" (as opposed to something more serious). 
\end{verbatim}


\begin{verbatim}
While I feel confident as an associate editor in determining whether the
more minor comments from this review have been addressed, I may need to
pass the paper back to this referee to check that they are satisfied with
the treatment and response in relation to those comments and requests
listed under "major comments". I think that it is right that I should do
this, but also, clearly very important that I should alert you to the
possibility that the manuscript may go back to this referee for comment
and that you should, therefore, be careful to address these concerns and
to write a response that explains that manner in which you have done so.
\end{verbatim}


\begin{verbatim}
I realise that this process has been a rather long one, but I hope that
you will be able to address these comments and that, as a result I will
then be in a position to recommend final acceptance of this paper. I
believe that the paper makes a valuable contribution to work on search
based software engineering and creates important links between SBSE/SBST
and other approaches to test data generation (such as DART). 
\end{verbatim}


\begin{verbatim}
I would like to add one additional request that you consider whether or
not it might be appropriate to add some mention of search based software
engineering in the related work section; It seemed rather odd to me that
the related work section had no mention of the fact that there is a large
body of work now accruing on the application of SBSE to testing. I would
not want to abuse my position as AE in any way and so I would make it
clear that I do not make this a requirement of acceptance, but I would
like to raise this as a point for you to consider.
\end{verbatim}


\begin{verbatim}
Once again, thank you very much for your contribution and your interest in
TSE
\end{verbatim}


\begin{verbatim}
Reviewers' Comments
\end{verbatim}


\begin{verbatim}
Reviewer: 1
\end{verbatim}


\begin{verbatim}
Recommendation: Author Should Prepare A Minor Revision
\end{verbatim}


\begin{verbatim}
Comments:
In general, I like this paper, but there is still some (possibly) minor
work to do.  The section below called 'Major Comments' identifies my major
concerns.
\end{verbatim}


\begin{verbatim}
Response to Previous Comments
------------------------------------
\end{verbatim}


\begin{verbatim}
The authors have also largely addressed the majority of the comments I had
on the previous revision. Some commentary on this:
\end{verbatim}


\begin{verbatim}
(i) The 'spikes' of Fig. 1: I don't accept that the spikes are a *correct*
consequence of using integers; instead they are an *incorrect* artefact of
attempting to join discrete integer points by lines in the plot.  However,
this is a minor point and attempting to remove the lines could result in
less understandable figures. As it stands the important argument - that
reformulating the problems in terms of a probabilistic range of values
gives a guiding gradient to some part of the search space - is still made,
and so no further changes need be made to this figure.
\end{verbatim}


\begin{verbatim}
(ii) Serious consideration should still be given to how to improve figures
10, 12 and 15.  The lines for many classes are indistinguishable, in which
case why label the lines at all by a legend?  Simply giving more space to
the figures may help.  (See comments below about simplifying the text of
section IV which may save space which could be used for this purpose.)
\end{verbatim}


\begin{verbatim}
(iii) With respect of the empirical work of (now) section IV-D, I don't
think the fact of 16 SUTs avoids the argument that basing the analysis on
single runs for each SUT/number of eliminated gene types combination lack
robustness.  However, this is not a major point.
\end{verbatim}


\begin{verbatim}
\end{verbatim}


\begin{verbatim}
Major Comments
-------------------
\end{verbatim}


\begin{verbatim}
Nevertheless, there are still some major issues to be addressed:
\end{verbatim}


\begin{verbatim}
(1) Sections I, II and III read very well and describe the Nighthawk
system at an appropriate level of detail.  However, section IV (on FSS) is
significantly less easy to understand.  It is arguably too long: the key
arguments could have been made with less text, and omitting some of the
experimental detail.  For example, while the identification of the change
to the initialisation of numberOfCalls is an important outcome, is it
necessary to describe results for java.util in detail prior to this
change? (Also, the Apache Common Collections results - Fig 15 - are
described after this change, why is this not done for java.util also - in
place of Fig 12?) Similarly, the rejection of bestMerit and bestRank in
favour of avgMerit and avgRank could be summarised briefly.
\end{verbatim}


\begin{verbatim}
(2) The arguments in favour of using FSS are somewhat confused, possibly
by using too many different metrics/experiments.  While the result that
using 1 rather than 10 gene types result only a marginal decrease in
coverage / AUC (Fig 10/Fig 11) is meaningful, the argument that when using
one 1 gene type, 90% can be achieved in 10% of the time of the `full'
Nighthawk is less convincing.  The actual metric for time isn't stated
clearly, but if we assume that number of generations are a broadly similar
measure, then doesn't Figure 10 also suggest a similar result occurs when
using the `full' Nighthawk of all 10 gene types: i.e. it reaches about 90%
of its final value after about 10% of the generations?  In other words,
the ability to achieve 90% of the best coverage in 10% of the time isn't
clearly shown to be a property of using FSS, but rather a property of the
algorithm as a whole (which is retained when applying FSS).  In which
case, what is the key advantage of using FSS in this context?
\end{verbatim}


\begin{verbatim}
(3) The results suggested that numberOfCalls was initialised
inappropriately, and that once this was changed, it was no longer an
important gene type.  What evidence is there that this is now also true of
the gene types now considered most significant in NightHawk 1.1.  In other
words, is FSS just identifying those initial values that are set
inappropriately rather than those that are genuinely important throughout
the search?  (I don't think this needs necessarily to be answered via
experimentation, but it should be acknowledged.)
\end{verbatim}


\begin{verbatim}
(4) Section V has a title 'Properties of the Optimized System' following
on from the previous section IV titled 'Analysis and Optimization of the
GA' that applied FSS.  To me, this suggests that FSS was applied to the
algorithm for the results in section V, but this does not appear to be the
case (and if it was, it is not clearly stated).  The only reference to FSS
is for the Apache Common Collections were *only* 2 gene types were
omitted, and no reference to applying FSS is made for java.util.  If FSS
were not applied more aggressively in these concluding experiments, then
why consider FSS in the first place?
\end{verbatim}


\begin{verbatim}
\end{verbatim}


\begin{verbatim}
Minor Comments
-------------------
\end{verbatim}


\begin{verbatim}
(a) Keywords: keyword B.7.3.e (Test generation) refers to *hardware* not
*software* testing
\end{verbatim}


\begin{verbatim}
(b) p3, line 24 "In previous research, value reuse has mostly taken the
form of making a sequence of method calls all on the same receiver object"
The *implication* is that this work considers different receiver objects,
in which case this could be stated for clarity.
\end{verbatim}


\begin{verbatim}
(c) p5, line 15 "... and and the question ..."
\end{verbatim}


\begin{verbatim}
(d) p15, line 13 "(the additional coverage .... is probably > 0)" - not
sure of the justification nor relevance of this line (since it is clear
which SUT is being used - the triangle program) - suggest omitting it
\end{verbatim}


\begin{verbatim}
(e) p17, line 53 "further research work ... such as simulated annealing" -
arguably wouldn't the SA also be using "useless genes" like the SA.  No
change suggested, just making the point that a comparison could still be
made without applying FSS
\end{verbatim}


\begin{verbatim}
(f) p19,line 24 "the fewer are the demands on interfaces to the external
environment" - unclear what this means in this context
\end{verbatim}


\begin{verbatim}
(g) section IV.B might be easier to understand if a brief explanation of
what the groups and instances of RELIEF map to in the case of Nighthawk.
\end{verbatim}


\begin{verbatim}
(h) p20, line 39 "For the two BitSet gene types, we printed only the
cardinality of the set."  Why?
\end{verbatim}


\begin{verbatim}
(i) p21, line 20 If I understand correctly "gene type" here does NOT refer
to class/primitive types, but instead to the 10 categories of genes listed
in Fig 6.  In which case "categories" or similar might have been a better
choice of terminology to avoid confusion with class/primitive types (but a
wholesale change of terminology is not required at this point). 
Similarly, the notation t has previously been used to refer to
class/primitive types.
\end{verbatim}


\begin{verbatim}
(j) p21, line 47 "version 1.0 of Nighthawk" - while the meaning of 1.0 is
explained later, this is the first reference to the versions and ideally
the relevance could be explained
\end{verbatim}


\begin{verbatim}
(k) p24, line 52 "does show a statistically significant difference ..."
suggest adding "at the 5% level" 
\end{verbatim}


\begin{verbatim}
(l) p24 line 55 what metric is used for "time taken", and why doesn't AUC
adequately cover both coverage and time taken (using generations as the
measure of time taken).  In other words, does fixing gene types to
constant significantly change the runtime?  (If so, may be worth giving
some illustration of the magnitude of such changes.)
\end{verbatim}


\begin{verbatim}
(m) p29, line 34 This paragraph doesn't really make sense (what is "rapid
turnover"?), and may make too strong claims based on limited results.
\end{verbatim}


\begin{verbatim}
(n) p30, line 39 "Nighthawk v1.0" - small point but this is inconsistent
with section IV where "v" is not used for the version
\end{verbatim}


\begin{verbatim}
(o) p30, line 50 The statistical analysis is now appropriate and to be
commended (although, having applied a Wilcoxon test, a SW and t-test are
arguably unnecessary).
\end{verbatim}


\begin{verbatim}
Reviewer: 3
\end{verbatim}


\begin{verbatim}
Recommendation: Accept With No Changes
\end{verbatim}


\begin{verbatim}
Comments:
The authors describe a hybrid system which uses a Genetic Algorithm (GA)
to optimize settings for a random unit testing tool (for JAVA). To improve
the performance of the GA, the authors propose to use Feature Subset
Selection (FSS) to only focus the GA on the most useful genes.
\end{verbatim}


\begin{verbatim}
Overall the manuscript is very well written and easy to follow. The idea
of using FSS to reduce the size of the search space is indeed very
interesting and it would be very interesting to see if it can be applied
to GA's used to generate test data directly.
\end{verbatim}


\begin{verbatim}
I only have a few minor comments. The authors state that, in their
original work, the FSS was only able to eliminated 60% of the gene types,
while in this submission they can eliminate 90%. If I am not mistaken,
this equates to eliminating 9 out of the 10 gene types used, thus only
focusing on one feature: the number of method calls to perform. In a way
one can guess that this feature is important since the authors state on
page 6 that they are generating long sequences of method calls. (If long
sequences are really preferable to shorter ones achieving the same
coverage is a matter for debate). My concerns about the new FSS analysis
are the following. The authors state that the reason why the original
Nighthawk underperformed was because of a "bad" initial choice for the
value for this feature (numberOfCalls). They then go on to argue that,
once a better initial value has been picked, the GA optimization does not
statistically significantly affect either coverage or efficiency of the
Nighthawk system (page 30). So why do the authors claim (on page 26) that
changing the initial value of "numberOfCalls" results in a "very
different" behaviour of the Nighthawk system, resulting in two different
versions? The second concern I have is that if 90% of the features can be
eliminated, does Nighthawk still need a GA at all? What would the level of
coverage for Nighthawk be if no GA was used at all and Nighthawk was run
with the more optimal setting of 50 for "numberOfCalls"?
\end{verbatim}


\begin{verbatim}
Other minor comments:
Page 10:
It's not the case that previous GA based approaches necessarily result in
fixed sized test suites. For example, the work by Wegener et al. (IST,
43(14) 841-854) 2001) or McGraw et al (TSE, 27(12) 1085-1110 2001) takes
"serendipitous" coverage into account. This means that branches which are
covered "by accident" as the GA tries to optimize a target branch are
removed from the list of targets and the GA does not attempt to cover them
again. Since the GA is a stochastic algorithm, during different runs, it
may end up generating different sized test suites.
\end{verbatim}


\begin{verbatim}
Page 23:
At the bottom: "The thick black curves on those figures..." -> There seems
to be only one figure with a thick black curve?
\end{verbatim}


\begin{verbatim}