Hi Tim... Here is my analysis of the reviews. (BTW, in case you are curious, Felix Chun Hang Li was the MSc student who did the exploratory study. He finished last December and went off to a job in Toronto. His account was cancelled and I don't have a new email address for him, so you and I are the only authors that are in e-mail contact.) > *=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*= > > First reviewer's review: > > >>> Summary of the paper <<< > > The effectiveness of randomized unit testing (RUT) is highly dependent upon > how "randomization" is applied, e.g. the number and frequency with which > methods are called, the ranges from which parameter values are selected > etc. The paper promotes the use of a genetic algorithm in finding > parameters which optimize with respect to a test converge metric. The paper > begins by reporting on an exploratory study which showed that combining a > genetic algorithm (JDEAL) with a code coverage tool (Cobertura) could > potentially generate parameter weights that were more optimal than when > equal parameter weightings were used. Based upon this positive result, a > system called Nighthawk was developed. Nighthawk, unlike the exploratory > study, extended the use of GA to include the selection of parameter ranges, > reuse of previously calculated results etc. Nighthawk also automates the > generation of test wrappers, which provide the interface between the GA > level and the coverage tool. Test wrappers can be customized, e.g. a > wrapper can use a precondition to filter out particular patterns of method > calls. > > A comparative evaluation with a purely genetic approach was undertaken with > Myers' classic Triangle program. The Nighthawk worked extremely well on > this example. A comparison was also made with JPF (model checking) and > Randop (feedback-directed randomization), both comparisons were favourable. > > Nighthawk supports various switches relating to how it operates and > generates test wrappers, e.g. how deep it explores the methods associated > with the classes that it targets. In terms of test wrappers switches > support is provided for checking with respect to the the type of return > values and method serialization. The effectiveness of these switches were > evaluated via a case study (Java 1.5 Collection and Map classes). The > switches, not surprisingly, allowed Nighthawk to achieve greater coverage. > But what was surprisingly was that the enriched test wrappers did not > significantly increase the required running time. > > >>> Points in favour or against <<< > > Addresses a challenging problem with impressive results. The evaluation > contained a strong comparative element with rival techniques. The paper is > well written and structured. Summary: We rule. The reviewer probably gave an A. > =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*= > > Second reviewer's review: > > >>> Summary of the paper <<< > > This paper presents a two-level approach to test data generation that > combines random and evolutionary testing. The basic idea is to use genetic > algorithms to identify an optimal configuration for random test data > generators. The technique encodes configurations as chromosomes, and each > configuration is evaluated based on the coverage achieved by a random > test-data generator that uses that configuration. An empirical evaluation > of the approach shows that it can be fairly effective and that it is at > least as good as two alternative state-of-the art techniques. > > >>> Comments <<< > > This is a nicely written, motivated, and evaluated paper. The idea of using > GAs to configure random testing is new and interesting. The exploratory > study nicely sets the stage for the rest of the work. And the empirical > study is convincing. I especially appreciated the comparison with > alternative approaches, which is a big plus and really helps assessing the > effectiveness of the technique. Overall, I only have a couple of fairly > minor suggestions/comments. > > It would be nice to have two pictures, rather than two lists of numbers, to > represent the search spaces in Figure 1. Yep, I didn't have time to do the pictures, so I just did the tables of numbers. Do you think I should do pictures? It should be fairly easy to figure out how to do it in gnuplot. > In the bulleted list in Section 4.2, the last item may be misleading. The > authors should clarify that what is chosen is the element to be replaced in > the pool. Yes, this phrasing is unclear... I will fix it. > It would be interesting to know if the authors plan to release their tools. I will put something in the Conclusions saying that people can write to me for it. > Typos: > Section 3.3: "an t test" Yep > >>> Points in favour or against <<< > > Pros: > > - Nice idea - Well written - Nicely evaluated > > No major cons. Summary: We rule some more. Probably another A. The third reviewer is the kind of reviewer that I was afraid we would get: someone who assumes that randomized unit testing can't possibly work, gives arguments as to why it shouldn't work, and explains away the solid empirical evidence that it DOES work by saying that the subject software must be unrepresentative in some way. We're lucky that we got only 1 of these reviewers, rather than 2 or 3. As it is, since the other two reviewers seem very favourable, the third reviewer probably prevented it from being nominated as a best paper. The reviewer cites Tonella's paper and seems to know a lot about it, so I assume it's Tonella or someone close to him. It's very much written from the point of view of someone who believes that approaches like the one in Tonella's paper "should" work better. I had read Tonella's paper a while ago, but didn't cite that particular one (I cited Neelam Gupta's). So, I read Tonella's again, more carefully. Here's a summary. For each branch direction of the unit under test, Tonella evolves a new test case that covers that branch. The fitness function involves a measure of how close the covered code is to the branch (so, he needs a static analyzer). A test case is a sequence of method calls, like what we are doing. It's certainly close in spirit to what we are doing, so it's reasonable to cite it and compare. But... the ways in which the test cases are randomly mutated by the GA are fixed and/or subject to hand-tuning. - There is a fixed, hard-coded distribution to the lengths of the test cases. The probability that an evolved test case has n method calls is about 0.5^(n^2), and there's no way of adjusting that. - The user picks ranges for the int and float parameters, or uses hard-coded ranges; the ranges are not evolved. - There's no discussion of weights for the different methods, so I assume they're fixed at all the same weight. Nighthawk peforms evolution on each of these aspects of the test case construction. The empirical results that he gives are hard to interpret and not (IMHO) very impressive. He does use java.util classes, but he reports only on coverage of branches in public methods. This is almost meaningless, since lots of code is in the private methods that are accessed via the public methods, and he maintains a noble silence about how he performed on that code. Furthermore, sometimes the numbers he gives as "number of branches / number of branches covered" in a unit are odd and sometimes even, so I have no idea what he's counting (shouldn't the number of branches always be even?). In contrast, we report on coverage of all the lines of code in the entire source file, including in things like private methods in inner classes. The timing numbers are a bit disappointing too. It took him 1 hour and 22 minutes of compute time to cover everything that he covered for one of the units (BitSet). This was about 4 years ago, but still. He does some fault seeding and effectiveness measurement. This is not very impressive either -- 27 out of 35, or about 77% of mutants killed. This is sort of what I expect. He is evolving (without this being his intended goal) a minimal test suite that covers all branches. Minimal covering test suites have been shown by Rothermel and others to be not very good (there is some discussion of this in my Random Testing workshop paper that I cite). Bare-bones white box coverage is not sufficient -- we need something more. Randomized unit testing that achieves high coverage gets you that something more. This is a point that the reviewer misses -- I didn't point it out strongly, so I should. Once Nighthawk has evolved the parameters for the random testing, we can quickly crank out thousands of DIFFERENT high-coverage test cases. I have shown in previous papers (like the Random Testing paper) that this is what leads to the greater effectiveness of randomized testing. Approaches like Tonella's that evolve individual test cases can generate one, minimal, covering test suite. To do what Nighthawk does using tools like Tonella's, we have to re-run the test suite generation many times, generating (hopefully) different covering test suites. Gosh darn that 1 hour and 22 minute running time. Anyway, enough of this rambling, let me get on to his specific points. > =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*= > > Third reviewer's review: > > >>> Summary of the paper <<< > > The paper presents an approach for generating unit tests based on combining > generic algorithm and random testing. The generic algorithm is used to find > the optimal configuration for random testing. Then random testing is > applied to generate test data. The proposed approach has been implemented > and compared with results achieved by existing techniques (found in the > literature). The results show that the proposed approach achieve the same > coverage as the existing approaches. > > >>> Comments <<< > > I like the high-level idea of using genetic algorithm to find out the > optimal configuration for random testing because the adopted configuration > would affect the effectiveness of random testing. The idea is neat. > > But there are some issues and unclear places in the paper to be improved, > as listed below. > > The high level question is that generic algorithms also incorporate > randomness or random testing in the process. Can the proposed idea be > turned into a form of improved generic algorithms rather than the two > phases of generic algorithm+random testing? The reviewer keeps saying "generic" instead of "genetic" here, but he clearly knows what he's talking about... and actually "generic" is not an inappropriate word to use here. The point that he is missing here is that randomized testing that achieves high coverage is more effective than generating individual test suites, either by genetic algorithms or any other method. We just happen to be using a genetic algorithm to find the parameters that allow the randomized testing to achieve high coverage. I'll try to bring it across. > The evaluation shows that the proposed approach achieve the same coverage > levels as the existing techniques. Then why do we need another one > technique? We need to have a technique that can achieve better coverage or > achieve the same coverage with less cost. If the proposed technique > requires less cost (the authors seem to hint on that), then the authors > need to explicitly state it in the abstract and introduction. Actually, the Michael paper and the Visser paper dealt with fairly "toy" examples that are easy to cover with many techniques. The Pacheco paper dealt with the same Java classes as the Nighthawk paper, but they didn't reveal what coverage they achieved. This may be because the coverage wasn't very high; I don't know. > The arguments in the last paragraph of Section 2.2 are not strong. The > details are at the level of implementation. The benefits of the tool > implementation based on Java reflection are not that a big deal. The > authors oversell this point. Well, if he thinks that it's not a big deal to have to parse the source code, do a static analysis of the source code and link up the product of that static analysis with the test suite evolver, then all the best to him. But I still think it's an advantage. > The argument conveyed by Figure 1 seems not that general. The situation > seems to work only on a few cases like x==y. How about general cases? Can > we see the trend as well? I don't know if we need to respond to this. Figure 1 is just an example to give an intuition about why the approach works when it does work, compared to evolving test cases for specific decisions. The irony is that Tonella's paper uses an approach to this particular problem that is LESS general than ours -- it amounts to randomly selecting x and y from a user-set range. > The authors need to cite and compare with the following paper: > > Tonella, P. 2004. Evolutionary testing of classes. In Proceedings of the > 2004 ACM SIGSOFT international Symposium on Software Testing and Analysis > (Boston, Massachusetts, USA, July 11 - 14, 2004). ISSTA '04. ACM Press, New > York, NY, 119-128. DOI= http://doi.acm.org/10.1145/1007512.1007528 It's reasonable to compare to this paper, so I will do so. > The fitness function defined by the approach does not provide good guidance > (in a black box way). eToc (described in the above ISSTA 04 paper) provides > a more reasonable fitness function (based on coverage of relevant > dependencies with regards to uncovered locations) of guiding the search. I > have strong feeling that the proposed fitness function in this paper may > work well (i.e., overfit) on those data structure subjects mostly but not > general enough. The above is outrageous innuendo and supposition with absolutely no basis in reality. Why would the data structure subjects be uncharacteristic? He gives no argument as to why. > Assume that I have a special branch that requires specific > test data to cover but uncovered at the moment. eToc would work fine on it > because its fitness function is defined w.r.t. the uncovered branch and the > search is guided to gradually converge to the path that is required to > cover the uncovered branch. But the proposed fitness function has no strong > guidance without specially targeting at the uncovered branch. Yeah, because it's targeting all the code at once! Anything that gets it to cover more code, whether it is close to some chosen branch or not, will result in greater fitness for the chromosome. A test case that would receive greater fitness for Tonella's fitness function, would also receive greater fitness for ours simply because it covers more code. > The data > structure subjects used in the experiment mostly require only put or remove > (then one final invocation of a method containing the uncovered branch) to > set up the required method sequences for desirable receiver object states). Incorrect. If he had looked at the code (it's publically available to everyone) he would have seen that there are many, many complex methods in addition to put and remove. For instance, many of the classes have methods that return Iterators and Maps, and they implement inner classes for the Iterator and Map operations. Maybe he thinks we're considering only the code in the public methods like Tonella does. > In the second paragraph of Section 3.3, one reason why remove and put are > favored is that covering other observer methods may also need these methods > to construct or destruct the data structure states. This is certainly true. I could add some words to that effect. > In the third paragraph of Section 3.3, the authors found a bug in BitSet. > what is the test oracle being used to find the bug? I haven't seen that the > authors describe any test oracle. OK, this could be more clear. The oracles were in the test wrapper methods (hand-coded in that study). > Putting an exploratory study in the paper (Section 3) is a bit not common. > I understand the research project development indeed goes through such a > process but the authors may not need to describe the whole process of > research development but the end-product (research results) in the paper. However, the exploratory study indicates that even when we just consider test case length and method weights, we get good results. This shows that the GA level is doing something, and not everything is attributable to aspects of the overall strategy of the lower level, like the value reuse policies. > In Section 4.1, "All types of receivers, parameters and return values of > methods in M" Why do you need to collect return values of methods in M? If > the return values are not of any non-primitive-type arguments of other > methods, you don't need to collect them. Strictly speaking this is true, but it's more convenient to consider the return types to be types of interest so as to provide uniform processing of return values. The only problem with this is that it might cause inefficiency in the startup phase, as initial values are put into value pools that do not get used by anything. Do you think I should mention this? > "All primitive types that are the types of parameters to constructors of > other types of interest." WHY only constructors not also all normal methods? > > For non-primitive type arguments, if their desirable object states require > specific method sequences (beyond just constructor calls) to construct, can > the proposed techniques (Section 4.1) deal with the situation? Seems not. Incorrect. He's thinking that we're testing classes rather than groups of methods, and he doesn't understand the --deep switch. The set M of target methods is the thing we are targeting. We can put any methods in there that we want. However, for convenience the user is given two options. They can specify class names on the command line, without the --deep switch, and then M is all the methods in those classes. Or, they can specify the --deep switch too, and then M is all the methods in those classes, plus all the methods in the classes of parameters of those methods. Should I clarify this in the paper? > In Section 4.1, I don't get the naming intention of "initializer" and > "reinitializer" Do you also think this is unclear? The initializers are called right at the beginning. The reinitializers can be called at any time in order to clobber a value and replace it by a new value. > I don't fully understand the genetic algorithm part well. The generic > algorithm needs to evaluate fitness functions. I assumed that the random > test generation would happen here to see which configuration would be > better. But the authors describe the random test generation is after the > optimal configuration is found. Then how to evaluate the fitness function > values during the search in the genetic algorithm? It does happen in the fitness function evaluation. I'm not sure why this is unclear to him. In the evaluation of how well Nighthawk performed as a whole, after the winning chromosome is found, I run the random test generation several times to see what coverage it achieves, but this is separate from the fitness function. Do you think this is the point of unclarity? Can you suggest changes to make it clearer? > In Section 4.3, "... mutate the population" The authors need to give more > details on how to mutate. I didn't give any details on how much the parameters were mutated, etc. Do you think I should include it? BTW, when we use the "tight" article style, I think it will leave out the spaces between the paragraphs and we may get as much as half a page extra to work with. > > Section 4.5. contains a lot of low-level details that can be left out. This is what contains information about how oracles come into it. Without the information about this, we would have been open to the criticism that we don't consider how to evaluate the test results. I think it has to stay in. > In the end of Section 5.1, "value reuse policies that guaranteed frequent > selection of the same values." I suspect that the effectiveness of a > technique on the data structure subjects used in the experiment would be > heavily dependent on the value reuse policies instead of the proposed > approach of combining genetic algorithm and random testing. It is > understandable because these data structures usually need equality or > ordering relationship among data elements to cover various code parts. So > any technique that favors these types of data would work well. So I wonder > whether the good results in the experiment are due to this factor instead > of the proposed approach. :-) This seems to be echoing your opinions somewhat. Or, I could be misunderstanding you and/or the reviewer. :-) Here's how I look at it. Before Nighthawk, I was hand-coding value reuse ("stickiness") policies. That is, I wrote specific lines of code in each test wrapper that said where to put return values and where to get parameters from, sometimes governed by random variables. The referees for earlier papers didn't like this, because they said it seemed too difficult and clumsy. So with Nighthawk, I worked out a way of encoding a value reuse policy as a set of genes, and used a genetic algorithm to figure out which policy was best. Of course the idea of stickiness is important. But I can't compare [GA + stickiness] with [stickiness alone] because there's no way to separate the two without adopting some fixed stickiness policy, and stickiness policies can vary widely in effectiveness. Does that make it clearer? Should I put some of this in the paper? > In Section 5.2, Nightawk runs faster than JPF..." The authors need to give > statistics on the runtime cost rather than just vaguely giving out the > conclusion. In addition, did the authors run these tools side by side on > the same machine to measure the runtime cost or just use the published > results in the literature? It seems the latter (otherwise, I would expect > the authors to show details on configuring JPF or other tools; in addition, > I don't think Randoop is publicly available). Then how can you be sure that > the machines being used by other tools have the same configuration as the > one used in this experiment? True nuff. It's on a different platform. I forgot to insert verbiage about how this comparison might not be valid. > In Figure 4, the authors should include the breakdown of time on two > phases: generic algorithm and random testing. I wonder if he means the random testing performed after the winning chromosome is found. He might be assuming that it takes forever. Actually the time is insignificant -- I think 10 seconds at most. Should I do what he asks? (One more column will fit, though not 4 more.) > In Section 5.2, why using the restricted wrappers or full target classes > have a big differences in the results? This is explained... dunno what more he needs. > In Section 5.2, saying and citing that 70-80% coverage is a reasonable goal > is not convincing to show that the proposed tool is good enough. The data > structure benchmarks are not that difficult to cover and the method > arguments are mostly primitive types or primitive-type objects. Incorrect. Spurious. He should give specific data that backs up his claim rather than just gainsaying what I say in my descriptions of the subject software. > General > real world classes are far more complicated with reasonable 70-80% > coverage. But 70-80% coverage here is not necessarily good enough. He's saying that real world classes are more complex than the Java 1.5.0 java.util map and collection classes? Has he actually looked at them? > In the experiment, the authors should compare the new approach with the > authors' previous approach (simply random). But there's no way of doing this without encoding specific value reuse policies, fixing parameter ranges and so on, for each and every unit. How can I do that? I think we can safely ignore this request. > >>> Points in favour or against <<< > > Favor: > > An important problem is being addressed. The idea of combining genetic > algorithm and random testing is very interesting > > Against: > > The evaluation is not convincing enough and the experimental results may > overfit only data structure subjects and there some other factors (like > value reuse policies) may dominate in contributing to the effectiveness of > the results instead of the proposed approach. > > =*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*=--=*= The "against" points are just supposition -- supposition that is motivated by unscientific prejudice in favour of Tonella's algorithm, not backed up by hard data, and ignoring the hard data that we give. We have shown that we do very well on the very same package that Tonella studies. The best that the reviewer can do is to dream up some vague reason why Tonella's approach might be more general than ours. cheers --Jamie.