264,266c264,270 < system options. In order to optimize the system, we used data < mining techniques to analyze which genes were the most useful. < We also report on the results of this analysis and optimization. --- > system options. We show that we can achieve the > same coverage as previous studies did, and that > we can achieve high coverage on real-world software units. > Finally, we describe how we used data > mining techniques to analyze which genes were the most useful, > and used this information to optimize the system so that it > ran more quickly with no loss of coverage. 376c380 < The system Nighthawk described in this paper significantly --- > The Nighthawk system described in this paper significantly 389a394,405 > Designing a GA means making decisions about what features are > worthy of modeling and mutating. For example, much of the effort > on this project was a laborious trial-and-error process of > trying different chromosomes. To simplify that process, we > describe experiments here with automatic feature subset > selection (FSS). First, we try a very large and elaborate set of > chromosomes. Next, we filter that set using automatic feature > subset selection. The filtered set ran over 40\% more quickly, with > no loss of coverage. We therefore propose that automatic > feature subset selection should be a routine part of the design > of any large GA system. > 402,406c418,424 < real-world units (the Java 1.5.0 Collection and Map classes) < to determine the effects of different option settings on the < basic algorithm. < \item We describe how we optimized Nighthawk by systematically < analyzing which genes have the greatest effect on the fitness --- > real-world units (the Java 1.5.0 Collection and > Map classes) to determine the effects of different option > settings on the basic algorithm. We show that Nighthawk can > achieve high coverage automatically on these units. > \item We describe how we optimized Nighthawk by using FSS to > systematically analyze which genes have the greatest effect on > the fitness 408c426,429 < little effect. --- > little effect. We show that the optimized system achieved > the same good results in significantly less time, demonstrating > the utility of augmenting GAs with automatic feature subset > selection. 418c439,440 < of the empirical work in the paper. --- > of the empirical work in the paper. The procedure and results > of our optimization are in Section 8. 686,687c708,709 < techniques. For instance, Tonella \cite{tonella-issta04} uses < a fitness function that specifically takes account of such --- > techniques. For instance, Michael et al.\ \cite{michael-etal-ga-tcg} > use fitness functions that specifically take account of such 722c744 < detection capability. The GA can of course be re-run to generate --- > detection capability. The GA can be re-run to generate 873,874c895,896 < and {\it remove} were needed to create data structures via which < code in some of the other methods was accessible. --- > and {\it remove} were needed to create data structures through > which code in some of the other methods was accessible. 955c977 < methods in $M$ plus the reinitializers of the types of $I_M$. --- > methods in $M$ plus the reinitializers of the types in $I_M$. 1157,1158c1179 < whether the argument will be drawn from the value < pools of that type} \\ --- > whether the argument will be of that type\vspace{1mm}} \\ 1264c1285 < values are removed, again due to value reuse. A {\tt remove} --- > keys will be removed, again due to value reuse. A {\tt remove} 1296c1317 < a manner identical to the exploratory study (Section --- > a manner identical to that of the exploratory study (Section 1519,1520c1540,1541 < For BHeap and FibHeap, Nighthawk runs faster than JPF, but for < the other two units it runs slower than both JPF and Randoop. --- > For BHeap and FibHeap, Nighthawk runs more quickly than JPF, but for > the other two units it runs more slowly than both JPF and Randoop. 1875c1896 < results in less time. --- > quality of results in less time. 1915c1936 < where, 83\% (on average) of the measures in a domain could be --- > where 83\% (on average) of the measures in a domain could be 1937c1958 < nearest neighbors for each class that is different from the --- > nearest neighbors for each class that is different from that of the 1965c1986 < For each of the 16 collection and map classes from {\tt java.util}, --- > For each of the 16 Collection and Map classes from {\tt java.util}, 1967c1988,1989 < yielded 800 observations of gene value and score. --- > yielded 800 observations, each consisting of a gene value vector > and the chromosome score. 1974c1996 < into three regions. --- > into three regions: 2001,2004c2023,2026 < The following table shows the number of features that were {\it Selected} in < our 19 examples, using different values for $\alpha$. Note that as < $\alpha$ increases, we selected fewer and fewer features. < \begin{tabular}{rl} --- > \begin{figure} > > \begin{center} > \begin{tabular}{|c|c|} 2006,2009d2027 < 0.5 & 439\\ < 0.6& 217 \\ < 0.7 & 112 \\ < 0.8 & 62 \\ 2011c2029,2050 < \end{tabular} --- > \hline > 0.8 & 62 \\ > \hline > 0.7 & 112 \\ > \hline > 0.6 & 217 \\ > \hline > 0.5 & 439\\ > \hline > \end{tabular} \vspace{2mm} \\ > \end{center} > > \caption{ > Numbers of selected features for values of $\alpha$. > } > \label{selected-features-fig} > \end{figure} > > Figure \ref{selected-features-fig} shows the number of features > that were {\it Selected} in > our 19 examples, using different values for $\alpha$. Note that as > $\alpha$ increases, we selected fewer and fewer features. 2045a2085 > Merit analysis of Nighthawk gene types. 2130,2131c2170,2171 < in a system that ran substantially faster but achieved the same < high coverage of the SUT. --- > in a system that ran substantially more quickly but achieved the > same high coverage of the SUT. 2160,2161c2200,2207 < Nighthawk is able to achieve high coverage of complex Java units. < The code is available by writing to the first author. --- > Nighthawk is able to achieve the same coverage as earlier > studies, and high coverage of complex, real-world Java units, > while retaining the most desirable feature of randomized > testing: the ability to generate many new high-coverage test > cases quickly. > We have also shown that we were able to simplify the design > of the GA system and improve its runtime using automatic > feature subset selection. 2169c2213,2215 < efficiency. --- > efficiency. We also wish to integrate a feature subset > selection learner into the GA level of the Nighthawk algorithm > for dynamic optimization of the GA.