Hi Tim...

     Here is my analysis of the reviews.  (BTW, in case you are
curious, Felix Chun Hang Li was the MSc student who did the
exploratory study.  He finished last December and went off to a
job in Toronto.  His account was cancelled and I don't have a
new email address for him, so you and I are the only authors
that are in e-mail contact.)

> *=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=
> 
> First reviewer's review:
> 
>           >>> Summary of the paper <<<
> 
> The effectiveness of randomized unit testing (RUT) is highly dependent upon
> how "randomization" is applied, e.g. the number and frequency with which
> methods are called, the ranges from which parameter values are selected
> etc. The paper promotes the use of a genetic algorithm in finding
> parameters which optimize with respect to a test converge metric. The paper
> begins by reporting on an exploratory study which showed that combining a
> genetic algorithm (JDEAL) with a code coverage tool (Cobertura) could
> potentially generate parameter weights that were more optimal than when
> equal parameter weightings were used. Based upon this positive result, a
> system called Nighthawk was developed. Nighthawk, unlike the exploratory
> study, extended the use of GA to include the selection of parameter ranges,
> reuse of previously calculated results etc. Nighthawk also automates the
> generation of test wrappers, which provide the interface between the GA
> level and the coverage tool. Test wrappers can be customized, e.g. a
> wrapper can use a precondition to filter out particular patterns of method
> calls.
> 
> A comparative evaluation with a purely genetic approach was undertaken with
> Myers' classic Triangle program. The Nighthawk worked extremely well on
> this example. A comparison was also made with JPF (model checking) and
> Randop (feedback-directed randomization), both comparisons were favourable.
> 
> Nighthawk supports various switches relating to how it operates and
> generates test wrappers, e.g. how deep it explores the methods associated
> with the classes that it targets. In terms of test wrappers switches
> support is provided for checking with respect to the the type of return
> values and method serialization. The effectiveness of these switches were
> evaluated via a case study (Java 1.5 Collection and Map classes). The
> switches, not surprisingly, allowed Nighthawk to achieve greater coverage.
> But what was surprisingly was that the enriched test wrappers did not
> significantly increase the required running time.
> 
>           >>> Points in favour or against <<<
> 
> Addresses a challenging problem with impressive results. The evaluation
> contained a strong comparative element with rival techniques. The paper is
> well written and structured.

     Summary:  We rule.  The reviewer probably gave an A.

> =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=
> 
> Second reviewer's review:
> 
>           >>> Summary of the paper <<<
> 
> This paper presents a two-level approach to test data generation that
> combines random and evolutionary testing. The basic idea is to use genetic
> algorithms to identify an optimal configuration for random test data
> generators. The technique encodes configurations as chromosomes, and each
> configuration is evaluated based on the coverage achieved by a random
> test-data generator that uses that configuration. An empirical evaluation
> of the approach shows that it can be fairly effective and that it is at
> least as good as two alternative state-of-the art techniques.
> 
>           >>> Comments <<<
> 
> This is a nicely written, motivated, and evaluated paper. The idea of using
> GAs to configure random testing is new and interesting. The exploratory
> study nicely sets the stage for the rest of the work. And the empirical
> study is convincing. I especially appreciated the comparison with
> alternative approaches, which is a big plus and really helps assessing the
> effectiveness of the technique. Overall, I only have a couple of fairly
> minor suggestions/comments.
>
> It would be nice to have two pictures, rather than two lists of numbers, to
> represent the search spaces in Figure 1.

     Yep, I didn't have time to do the pictures, so I just did
the tables of numbers.  Do you think I should do pictures?  It
should be fairly easy to figure out how to do it in gnuplot.

> In the bulleted list in Section 4.2, the last item may be misleading. The
> authors should clarify that what is chosen is the element to be replaced in
> the pool.

     Yes, this phrasing is unclear... I will fix it.

> It would be interesting to know if the authors plan to release their tools.

     I will put something in the Conclusions saying that people
can write to me for it.

> Typos:
> Section 3.3: "an t test"

     Yep

>           >>> Points in favour or against <<<
> 
> Pros:
> 
> - Nice idea - Well written - Nicely evaluated
> 
> No major cons.

     Summary:  We rule some more.  Probably another A.

     The third reviewer is the kind of reviewer that I was
afraid we would get:  someone who assumes that randomized unit
testing can't possibly work, gives arguments as to why it
shouldn't work, and explains away the solid empirical evidence
that it DOES work by saying that the subject software must be
unrepresentative in some way.  We're lucky that we got only 1 of
these reviewers, rather than 2 or 3.  As it is, since the other
two reviewers seem very favourable, the third reviewer probably
prevented it from being nominated as a best paper.

     The reviewer cites Tonella's paper and seems to know a lot
about it, so I assume it's Tonella or someone close to him.
It's very much written from the point of view of someone who
believes that approaches like the one in Tonella's paper
"should" work better.  I had read Tonella's paper a while ago,
but didn't cite that particular one (I cited Neelam Gupta's).
So, I read Tonella's again, more carefully.

     Here's a summary.  For each branch direction of the unit
under test, Tonella evolves a new test case that covers that
branch.  The fitness function involves a measure of how close
the covered code is to the branch (so, he needs a static
analyzer).  A test case is a sequence of method calls, like
what we are doing.  It's certainly close in spirit to what we
are doing, so it's reasonable to cite it and compare.

     But... the ways in which the test cases are randomly
mutated by the GA are fixed and/or subject to hand-tuning.
- There is a fixed, hard-coded distribution to the lengths of
  the test cases.  The probability that an evolved test case has
  n method calls is about 0.5^(n^2), and there's no way of
  adjusting that.
- The user picks ranges for the int and float parameters, or
  uses hard-coded ranges; the ranges are not evolved.
- There's no discussion of weights for the different methods,
  so I assume they're fixed at all the same weight.
Nighthawk peforms evolution on each of these aspects of the test
case construction.

     The empirical results that he gives are hard to interpret
and not (IMHO) very impressive.  He does use java.util classes,
but he reports only on coverage of branches in public methods.
This is almost meaningless, since lots of code is in the private
methods that are accessed via the public methods, and he
maintains a noble silence about how he performed on that code.
Furthermore, sometimes the numbers he gives as "number of
branches / number of branches covered" in a unit are odd and
sometimes even, so I have no idea what he's counting (shouldn't
the number of branches always be even?).  In contrast, we report
on coverage of all the lines of code in the entire source file,
including in things like private methods in inner classes.

     The timing numbers are a bit disappointing too.  It took
him 1 hour and 22 minutes of compute time to cover everything
that he covered for one of the units (BitSet).  This was about 4
years ago, but still.

     He does some fault seeding and effectiveness measurement. 
This is not very impressive either -- 27 out of 35, or about 77%
of mutants killed.  This is sort of what I expect.  He is
evolving (without this being his intended goal) a minimal test
suite that covers all branches.  Minimal covering test suites
have been shown by Rothermel and others to be not very good
(there is some discussion of this in my Random Testing workshop
paper that I cite).  Bare-bones white box coverage is not
sufficient -- we need something more.  Randomized unit testing
that achieves high coverage gets you that something more.

     This is a point that the reviewer misses -- I didn't point
it out strongly, so I should.  Once Nighthawk has evolved the
parameters for the random testing, we can quickly crank out
thousands of DIFFERENT high-coverage test cases.  I have shown
in previous papers (like the Random Testing paper) that this is
what leads to the greater effectiveness of randomized testing.
Approaches like Tonella's that evolve individual test cases can
generate one, minimal, covering test suite.  To do what
Nighthawk does using tools like Tonella's, we have to re-run the
test suite generation many times, generating (hopefully)
different covering test suites.  Gosh darn that 1 hour and 22
minute running time.

     Anyway, enough of this rambling, let me get on to his
specific points.

> =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=
> 
> Third reviewer's review:
> 
>           >>> Summary of the paper <<<
> 
> The paper presents an approach for generating unit tests based on combining
> generic algorithm and random testing. The generic algorithm is used to find
> the optimal configuration for random testing. Then random testing is
> applied to generate test data. The proposed approach has been implemented
> and compared with results achieved by existing techniques (found in the
> literature). The results show that the proposed approach achieve the same
> coverage as the existing approaches.
> 
>           >>> Comments <<<
> 
> I like the high-level idea of using genetic algorithm to find out the
> optimal configuration for random testing because the adopted configuration
> would affect the effectiveness of random testing. The idea is neat.
> 
> But there are some issues and unclear places in the paper to be improved,
> as listed below.
> 
> The high level question is that generic algorithms also incorporate
> randomness or random testing in the process. Can the proposed idea be
> turned into a form of improved generic algorithms rather than the two
> phases of generic algorithm+random testing?

     The reviewer keeps saying "generic" instead of "genetic"
here, but he clearly knows what he's talking about... and
actually "generic" is not an inappropriate word to use here.

     The point that he is missing here is that randomized
testing that achieves high coverage is more effective than
generating individual test suites, either by genetic algorithms
or any other method.  We just happen to be using a genetic
algorithm to find the parameters that allow the randomized
testing to achieve high coverage.  I'll try to bring it across.

> The evaluation shows that the proposed approach achieve the same coverage
> levels as the existing techniques. Then why do we need another one
> technique? We need to have a technique that can achieve better coverage or
> achieve the same coverage with less cost. If the proposed technique
> requires less cost (the authors seem to hint on that), then the authors
> need to explicitly state it in the abstract and introduction.

     Actually, the Michael paper and the Visser paper dealt with
fairly "toy" examples that are easy to cover with many
techniques.  The Pacheco paper dealt with the same Java classes
as the Nighthawk paper, but they didn't reveal what coverage
they achieved.  This may be because the coverage wasn't very
high; I don't know.

> The arguments in the last paragraph of Section 2.2 are not strong. The
> details are at the level of implementation. The benefits of the tool
> implementation based on Java reflection are not that a big deal. The
> authors oversell this point.

     Well, if he thinks that it's not a big deal to have to
parse the source code, do a static analysis of the source code
and link up the product of that static analysis with the test
suite evolver, then all the best to him.  But I still think it's
an advantage.

> The argument conveyed by Figure 1 seems not that general. The situation
> seems to work only on a few cases like x==y. How about general cases? Can
> we see the trend as well?

     I don't know if we need to respond to this.  Figure 1 is
just an example to give an intuition about why the approach
works when it does work, compared to evolving test cases for
specific decisions.  The irony is that Tonella's paper uses an
approach to this particular problem that is LESS general than
ours -- it amounts to randomly selecting x and y from a user-set
range.

>  The authors need to cite and compare with the following paper:
> 
> Tonella, P. 2004. Evolutionary testing of classes. In Proceedings of the
> 2004 ACM SIGSOFT international Symposium on Software Testing and Analysis
> (Boston, Massachusetts, USA, July 11 - 14, 2004). ISSTA '04. ACM Press, New
> York, NY, 119-128. DOI= http://doi.acm.org/10.1145/1007512.1007528

     It's reasonable to compare to this paper, so I will do so.

> The fitness function defined by the approach does not provide good guidance
> (in a black box way). eToc (described in the above ISSTA 04 paper) provides
> a more reasonable fitness function (based on coverage of relevant
> dependencies with regards to uncovered locations) of guiding the search. I
> have strong feeling that the proposed fitness function in this paper may
> work well (i.e., overfit) on those data structure subjects mostly but not
> general enough.

     The above is outrageous innuendo and supposition with
absolutely no basis in reality.  Why would the data structure
subjects be uncharacteristic?  He gives no argument as to why.

> Assume that I have a special branch that requires specific
> test data to cover but uncovered at the moment. eToc would work fine on it
> because its fitness function is defined w.r.t. the uncovered branch and the
> search is guided to gradually converge to the path that is required to
> cover the uncovered branch. But the proposed fitness function has no strong
> guidance without specially targeting at the uncovered branch.

     Yeah, because it's targeting all the code at once!
Anything that gets it to cover more code, whether it is close to
some chosen branch or not, will result in greater fitness for
the chromosome.  A test case that would receive greater fitness
for Tonella's fitness function, would also receive greater
fitness for ours simply because it covers more code.

> The data
> structure subjects used in the experiment mostly require only put or remove
> (then one final invocation of a method containing the uncovered branch) to
> set up the required method sequences for desirable receiver object states).

     Incorrect.  If he had looked at the code (it's publically
available to everyone) he would have seen that there are many,
many complex methods in addition to put and remove.  For
instance, many of the classes have methods that return Iterators
and Maps, and they implement inner classes for the Iterator
and Map operations.  Maybe he thinks we're considering only the
code in the public methods like Tonella does.

> In the second paragraph of Section 3.3, one reason why remove and put are
> favored is that covering other observer methods may also need these methods
> to construct or destruct the data structure states.

     This is certainly true.  I could add some words to that effect.

> In the third paragraph of Section 3.3, the authors found a bug in BitSet.
> what is the test oracle being used to find the bug? I haven't seen that the
> authors describe any test oracle.

     OK, this could be more clear.  The oracles were in the test
wrapper methods (hand-coded in that study).

> Putting an exploratory study in the paper (Section 3) is a bit not common.
> I understand the research project development indeed goes through such a
> process but the authors may not need to describe the whole process of
> research development but the end-product (research results) in the paper.

     However, the exploratory study indicates that even when we
just consider test case length and method weights, we get good
results.  This shows that the GA level is doing something, and
not everything is attributable to aspects of the overall
strategy of the lower level, like the value reuse policies.

> In Section 4.1, "All types of receivers, parameters and return values of
> methods in M" Why do you need to collect return values of methods in M? If
> the return values are not of any non-primitive-type arguments of other
> methods, you don't need to collect them.

     Strictly speaking this is true, but it's more convenient to
consider the return types to be types of interest so as to
provide uniform processing of return values.  The only problem
with this is that it might cause inefficiency in the startup
phase, as initial values are put into value pools that do not
get used by anything.  Do you think I should mention this?

> "All primitive types that are the types of parameters to constructors of
> other types of interest." WHY only constructors not also all normal methods?
> 
> For non-primitive type arguments, if their desirable object states require
> specific method sequences (beyond just constructor calls) to construct, can
> the proposed techniques (Section 4.1) deal with the situation? Seems not.

     Incorrect.  He's thinking that we're testing classes rather
than groups of methods, and he doesn't understand the --deep
switch.  The set M of target methods is the thing we are
targeting.  We can put any methods in there that we want. 
However, for convenience the user is given two options.  They
can specify class names on the command line, without the --deep
switch, and then M is all the methods in those classes.  Or,
they can specify the --deep switch too, and then M is all the
methods in those classes, plus all the methods in the classes of
parameters of those methods.  Should I clarify this in the paper?

> In Section 4.1, I don't get the naming intention of "initializer" and
> "reinitializer"

     Do you also think this is unclear?  The initializers are
called right at the beginning.  The reinitializers can be called
at any time in order to clobber a value and replace it by a new
value.

> I don't fully understand the genetic algorithm part well. The generic
> algorithm needs to evaluate fitness functions. I assumed that the random
> test generation would happen here to see which configuration would be
> better. But the authors describe the random test generation is after the
> optimal configuration is found. Then how to evaluate the fitness function
> values during the search in the genetic algorithm?

     It does happen in the fitness function evaluation.  I'm not
sure why this is unclear to him.  In the evaluation of how well
Nighthawk performed as a whole, after the winning chromosome is
found, I run the random test generation several times to see
what coverage it achieves, but this is separate from the fitness
function.  Do you think this is the point of unclarity?  Can you
suggest changes to make it clearer?

> In Section 4.3, "... mutate the population" The authors need to give more
> details on how to mutate.

     I didn't give any details on how much the parameters were
mutated, etc.  Do you think I should include it?

     BTW, when we use the "tight" article style, I think it will
leave out the spaces between the paragraphs and we may get as
much as half a page extra to work with.

> 
> Section 4.5. contains a lot of low-level details that can be left out.

     This is what contains information about how oracles come
into it.  Without the information about this, we would have been
open to the criticism that we don't consider how to evaluate the
test results.  I think it has to stay in.

> In the end of Section 5.1, "value reuse policies that guaranteed frequent
> selection of the same values." I suspect that the effectiveness of a
> technique on the data structure subjects used in the experiment would be
> heavily dependent on the value reuse policies instead of the proposed
> approach of combining genetic algorithm and random testing. It is
> understandable because these data structures usually need equality or
> ordering relationship among data elements to cover various code parts. So
> any technique that favors these types of data would work well. So I wonder
> whether the good results in the experiment are due to this factor instead
> of the proposed approach.

     :-)  This seems to be echoing your opinions somewhat.
Or, I could be misunderstanding you and/or the reviewer.  :-)

     Here's how I look at it.  Before Nighthawk, I was
hand-coding value reuse ("stickiness") policies.  That is, I
wrote specific lines of code in each test wrapper that said
where to put return values and where to get parameters from,
sometimes governed by random variables.  The referees for
earlier papers didn't like this, because they said it seemed too
difficult and clumsy.

     So with Nighthawk, I worked out a way of encoding a value
reuse policy as a set of genes, and used a genetic algorithm to
figure out which policy was best.  Of course the idea of
stickiness is important.  But I can't compare [GA + stickiness]
with [stickiness alone] because there's no way to separate the
two without adopting some fixed stickiness policy, and
stickiness policies can vary widely in effectiveness.

     Does that make it clearer?  Should I put some of this in
the paper?

> In Section 5.2, Nightawk runs faster than JPF..." The authors need to give
> statistics on the runtime cost rather than just vaguely giving out the
> conclusion. In addition, did the authors run these tools side by side on
> the same machine to measure the runtime cost or just use the published
> results in the literature? It seems the latter (otherwise, I would expect
> the authors to show details on configuring JPF or other tools; in addition,
> I don't think Randoop is publicly available). Then how can you be sure that
> the machines being used by other tools have the same configuration as the
> one used in this experiment?

     True nuff.  It's on a different platform.  I forgot to
insert verbiage about how this comparison might not be valid.

> In Figure 4, the authors should include the breakdown of time on two
> phases: generic algorithm and random testing.

     I wonder if he means the random testing performed after the
winning chromosome is found.  He might be assuming that it takes
forever.  Actually the time is insignificant -- I think 10
seconds at most.  Should I do what he asks?  (One more column
will fit, though not 4 more.)

> In Section 5.2, why using the restricted wrappers or full target classes
> have a big differences in the results?

     This is explained... dunno what more he needs.

> In Section 5.2, saying and citing that 70-80% coverage is a reasonable goal
> is not convincing to show that the proposed tool is good enough. The data
> structure benchmarks are not that difficult to cover and the method
> arguments are mostly primitive types or primitive-type objects.

     Incorrect.  Spurious.  He should give specific data that
backs up his claim rather than just gainsaying what I say in my
descriptions of the subject software.

> General
> real world classes are far more complicated with reasonable 70-80%
> coverage. But 70-80% coverage here is not necessarily good enough.

     He's saying that real world classes are more complex than
the Java 1.5.0 java.util map and collection classes?  Has he
actually looked at them?

> In the experiment, the authors should compare the new approach with the
> authors' previous approach (simply random).

     But there's no way of doing this without encoding specific
value reuse policies, fixing parameter ranges and so on, for
each and every unit.  How can I do that?  I think we can safely
ignore this request.

>           >>> Points in favour or against <<<
> 
> Favor:
> 
> An important problem is being addressed. The idea of combining genetic
> algorithm and random testing is very interesting
> 
> Against:
> 
> The evaluation is not convincing enough and the experimental results may
> overfit only data structure subjects and there some other factors (like
> value reuse policies) may dominate in contributing to the effectiveness of
> the results instead of the proposed approach.
> 
> =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=

     The "against" points are just supposition -- supposition
that is motivated by unscientific prejudice in favour of
Tonella's algorithm, not backed up by hard data, and ignoring
the hard data that we give.  We have shown that we do very well
on the very same package that Tonella studies.  The best that
the reviewer can do is to dream up some vague reason why
Tonella's approach might be more general than ours.

cheers
--Jamie.