49,50c49,50 < \documentclass[12pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran} < %\documentclass[9.5pt,journal,final,finalsubmission,twocolumn]{IEEEtran} --- > %\documentclass[12pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran} > \documentclass[9.5pt,journal,final,finalsubmission,twocolumn]{IEEEtran} 198,199c198,208 < < --- > \newcommand{\bi}{\begin{itemize}} > \newcommand{\ei}{\end{itemize}} > \newcommand{\bd}{\begin{description}} > \newcommand{\ed}{\end{description}} > \newcommand{\bbi}{\begin{itemize}} > \newcommand{\eei}{\end{itemize}} > \newcommand{\be}{\begin{enumerate}} > \newcommand{\ee}{\end{enumerate}} > \newcommand{\fig}[1]{Figure~\ref{fig:#1}} > \newcommand{\eq}[1]{Equation~\ref{eq:#1}} > \newcommand{\tion}[1]{\S\ref{sec:#1}} 203c212 < \title{Using a Genetic Algorithm to Control Randomized Unit Testing} --- > \title{Controlling Unit Testing With Genetic Algorithms } 256,257c265,266 < Randomized testing has been shown to be an effective method for < testing software units. However, the thoroughness of randomized --- > Randomized testing is an effective method for > testing software units. Thoroughness of randomized 260,264c269,272 < are called. In this paper, we describe a system which uses a < genetic algorithm to find parameters for randomized unit testing < that optimize test coverage. We compare our coverage results to < previous work, and report on case studies and experiments on < system options. We show that we can achieve the --- > are called. Hence, we describe NIGHTHAWK, a system which uses a > genetic algorithm (GA) to find parameters for randomized unit testing > that optimize test coverage. > We show that NIGHTHAWK achieves the 267,270c275,283 < Finally, we describe how we used data < mining techniques to analyze which genes were the most useful, < and used this information to optimize the system so that it < ran more quickly with no loss of coverage. --- > > Designing GAs is somewhat of a black art. We therefore use > a feature subset selection (FSS) > tool to assessing the mutators inside a GA. Using that tool, > we can prune back 90\% of our mutators, while still achieving most > of the coverage found using all the mutators. Our pruned GA achieves very near > the same results as the full system, but in only $\frac{1}{10}$-th the time. > These pruning results suggest that FSS for mutator pruning could significantly > optimized meta-heuristic search-based software engineering tools. 275c288 < data mining. --- > feature subset selection. 308c321 < The thoroughness of randomized unit testing is highly dependent --- > The thoroughness of randomized unit testing is dependent 315c328 < which we refer to as the {\it value reuse policy}, is also --- > which we call the {\it value reuse policy}, is also 319,320c332,333 < In this paper, we describe Nighthawk, a system for generating < unit test data. The system can be viewed as consisting of two --- > This paper describes the Nighthawk unit test data generator. > Nighthawk has two 323c336 < specified as genes in a chromosome, including parameters that --- > specified as genes in a chromozone, including parameters that 326c339 < mutation and recombination of chromosomes to find good values --- > mutation and recombination of chromozones to find good values 329d341 < 335a348,355 > This paper also discusses optimization methods for GA tools like > Nighthawk. Using feature subset selection methods, we show that > we can prune many of Nighthawk's mutator operators without > compromising the coverage generated by the tool. The pruned Nighthawk > tool achieves very nearly the same coverage as full Nighthawk (90\%) > and does so ten times faster. Therefore, we recommend that > meta-heuristic > search-based SE tools should also routinely feature subset selection. 385,386c405,406 < the pools according to a policy specified in a chromosome. < The chromosome also specifies relative frequencies of methods, --- > the pools according to a policy specified in a chromozone. > The chromozone also specifies relative frequencies of methods, 397c417 < trying different chromosomes. To simplify that process, we --- > trying different chromozomes. To simplify that process, we 414,415c434,435 < \item We compare Nighthawk to other systems described in < previous research, showing that it can achieve the same --- > \item We compare Nighthawk to > prior research, showing that it can achieve the same 422,429c442,449 < \item We describe how we optimized Nighthawk by using FSS to < systematically analyze which genes have the greatest effect on < the fitness < of a chromosome, and eliminating genes that we found to have < little effect. We show that the optimized system achieved < the same good results in significantly less time, demonstrating < the utility of augmenting GAs with automatic feature subset < selection. --- > \item We demonstrate the value of feature subset selection (FSS) > for optimizing genetic algorithms. > Using FSS, > we can prune many of > Nighthawk's mutators while achieving nearly the coverage (90\%). > Compared to the runtimes of full Nighthawk, > this coverage is achieved > ten times faster. 435,440c455,470 < could find useful parameter settings. In section 4, we describe < the design and use of Nighthawk. Section 5 contains our < comparison to previous work, and section 6 our case study; < section 7 contains a discussion of the threats to validity < of the empirical work in the paper. The procedure and results < of our optimization are in Section 8. --- > could find useful parameter settings. > %In section 4, we describe > %the design and use of Nighthawk. Section 5 contains our > %comparison to previous work, and section 6 our case study; > %section 7 contains a discussion of the threats to validity > %of the empirical work in the paper. Mutator pruning > %is described in > %Section 8. > % > In section 4, we describe > the design and use of Nighthawk. > Section 5 describes our case study and > section 6 contains a discussion of the threats to validity > of the empirical work in the paper. Mutator pruning > with feature subset selection is described in > Section 7. 525,527c555,559 < Finally, the aforementioned research by Pacheco < et al.\ \cite{pacheco-etal-icse2007} enhances the thoroughness < of randomized testing by doing a partial randomized --- > Also, Pacheco > et al.\ \cite{pacheco-etal-icse2007} also show randomized > testing can be > enhanced via > partial randomized 549,552c581,585 < Approaches to test data generation via symbolic execution have < existed as far back as 1976 < \cite{clarke-76-testdata,king-symbolic-1976}. < Such approaches typically attempt to generate a thorough set of --- > Approaches to test data generation via symbolic execution > date back > to 1976 > \cite{clarke-76-testdata,king-symbolic-1976}; > typically to generate a thorough set of 560,561c593,594 < Other source code analysis-based approaches have used such < methods as iterative relaxation of a set of constraints on --- > Other source code analysis tools have applied > iterative relaxation of a set of constraints on 573c606 < Some analysis-based approaches are limited in the range of --- > Some analysis-based approaches limit the range of 576c609 < usefully to conditions involving pointers. In addition, most --- > to conditions involving pointers. In addition, most 586,588c619,623 < \cite{holland-ga-book}. GAs typically represent a solution to a < problem as a ``chromosome'', with various aspects of the < solution to the problem represented as ``genes'' in the --- > \cite{holland-ga-book}. Candidate > solutions > are represented > as a ``chromosome'', with various solution options > represented as ``genes'' in the 595c630 < GAs have been found to be superior to --- > GAs can defeat 688,689c723,724 < %\includegraphics[width=2.5in]{myfigure.eps} < \includegraphics[width=3in]{xyPlot.eps} --- > %\includegraphics[width=2.5in]{myfigure.pdf} > \includegraphics[width=3in]{xyPlot.pdf} 705,706c740,741 < to GA white-box test data generation attempt to address this < problem by proposing fitness functions that detect ``how close'' --- > to GA white-box test data generation address this > problem with fitness functions that detect ``how close'' 719c754 < \includegraphics[width=3in]{lohiPlot.eps} --- > \includegraphics[width=3in]{lohiPlot.pdf} 735,736c770,772 < In all the approaches to GA-based test input generation that < we are aware of, each run of the GA results in a single --- > To the best of our knowledge, all GA-based > test input generators > results in a single 780c816 < in the source language. These complex tools are not often --- > in the source language. Such tools are not often 788c824 < the fact that the source code of many of the units had been --- > the source code of many of the units had been 791c827 < such as the ability to analyze multithreaded code --- > such as the ability to analyze multi-threaded code 799,921c835,957 < < \section{Exploratory Study} < \label{exploratory-section} < < To find out whether there was any merit in the idea of a < genetic-random system, we conducted an exploratory study. < In this section, we describe the prototype software we developed, < the design of the study and its results. < < \subsection{Software Developed} < < Using code from RUTE-J (see above) and the open-source genetic < algorithm package JDEAL \cite{jdeal}, we constructed a prototype < two-level genetic-random unit testing system that took Java < classes as its testing units. For each unit under test (UUT) with < $n$ methods to call, the GA level constructed a chromosome with < $2+n$ integer genes; these genes represented < the number of method calls to make in each < test case, the number of test cases to generate, and the < relative weights (calling frequencies) of the $n$ methods. All < other randomized testing parameters < were hard-coded in the test wrappers. < < The evaluation of the fitness of each chromosome $c$ proceeded as < follows. We got the random testing level to generate the number < of test cases of the length specified in $c$, using < the method weights specified in $c$. We then < measured the number of coverage points covered using < Cobertura \cite{cobertura-website}, < which measures line coverage. If we had based the < fitness function {\it only} on coverage, however, then any chromosome < would have benefitted from having a larger number of < method calls and test cases, since every new method call has the < potential of covering more code. < We therefore built in a brake to prevent these values from < getting unfeasibly high. We calculated the fitness function < as: < \begin{center} < (number of coverage points covered) $*$ (coverage factor) \\ < $-$ (number of method calls performed overall) < \end{center} < We set the coverage factor to 1000, meaning that we were willing < to make 1000 more method calls (but not more) if that meant < covering one more coverage point. < < \subsection{Experiment Design} < < We chose as our subject programs three units taken from the Java < 1.4.2 edition of {\tt java.util}: {\tt BitSet}, {\tt HashMap} < and {\tt TreeMap}. These units were clearly in wide use, < and {\tt TreeMap} had been used as the basis of earlier < experiments by other researchers \cite{visser-etal-issta04}. < For each UUT, we wrote a < test wrapper class containing methods that called selected < target methods of the UUT (16 methods for BitSet, 8 for HashMap < and 9 for TreeMap). < Each wrapper contained a simple oracle for checking correctness. < We instrumented each UUT using Cobertura. < < We ran the two-level algorithm 30 times on each of the three < test wrappers, and recorded the amount of time taken and the < parameters in the final chromosome. < To test whether the weights in the chromosomes were < useful given the length and number of method calls, for each < final chromosome $c$ we created a variant chromosome $c'$ < with the same length and number of method calls but with all < weights equal. We then compared the coverage achieved by < $c$ and $c'$ on 30 paired trials. Full results from the < experiment are available in \cite{cli-msc}. < < \subsection{Results} < < We performed two statistical tests to evaluate whether the < system was converging on a reasonable solution. First, we < ordered the average weights discovered for each method in each < class, and performed a $t$ test < with Bonferroni correction between each pair of < adjacent columns. < We found that for the {\tt HashMap} and {\tt TreeMap} units, the < {\tt clear} method (which removes all data from the map) had a < statistically significantly lower weight than the other methods, < indicating that the algorithm was consistently converging on a < solution in which it had a lower weight. < This is because much of the code in these < units can be executed only when there is a large amount of data < in the container objects. Since the {\tt clear} method clears < out all the data, executing it infrequently ensured that < the objects would get large enough. < < We also found that for the {\tt TreeMap} unit, the {\tt remove} < and {\tt put} methods had a statistically significantly higher < weight than the other methods. This is explainable by the large < amount of complex code in these methods and the private methods < that they call; it takes more calls to < cover this code than it does for the simpler code of < the other methods. Another reason is that sequences of {\it put} < and {\it remove} were needed to create data structures through < which code in some of the other methods was accessible. < < The second statistical test we performed tested < whether the weights found by the GA were efficient. < For this, we used the 30 trials comparing the discovered < chromosome $c$ and the equal-weight variant $c'$. We found that < for all three units, the equal-weight chromosome covered less < code than the < original, to a statistically significant level (as measured by a < $t$ test with $\alpha=0.05$). This can be interpreted as < meaning that the GA was correctly choosing a good {\it combination} < of parameters. < < In the course of the experiment, we found a bug in the Java < 1.4.2 version of {\tt BitSet}: when a call to {\tt set()} is < performed on a range of bits of length 0, the unit could later < return an incorrect ``length'' for the {\tt BitSet}. < We found that a bug report for this bug < had already been submitted to Sun's bug database. It has been < corrected in the Java 1.5.0 version of the library. < < In summary, the experiment indicated that the two-level < algorithm was potentially useful, and was consistently < converging on similar solutions that were more optimal than < calling all methods equally often. < --- > % > %\section{Exploratory Study} > %\label{exploratory-section} > % > %To test the merit of a > %genetic-random system, we conducted an exploratory study. > %In this section, we describe the prototype software we developed, > %the design of the study and its results. > % > %\subsection{Software Developed} > % > %Using code from RUTE-J (see above) and the open-source genetic > %algorithm package JDEAL \cite{jdeal}, we constructed a prototype > %two-level genetic-random unit testing system that took Java > %classes as its testing units. For each unit under test (UUT) with > %$n$ methods to call, the GA level constructed a chromosome with > %$2+n$ integer genes; these genes represented > %the number of method calls to make in each > %test case, the number of test cases to generate, and the > %relative weights (calling frequencies) of the $n$ methods. All > %other randomized testing parameters > %were hard-coded in the test wrappers. > % > %The evaluation of the fitness of each chromosome $c$ proceeded as > %follows. We got the random testing level to generate the number > %of test cases of the length specified in $c$, using > %the method weights specified in $c$. We then > %measured the number of coverage points covered using > %Cobertura \cite{cobertura-website}, > %which measures line coverage. If we had based the > %fitness function {\it only} on coverage, however, then any chromosome > %would have benefitted from having a larger number of > %method calls and test cases, since every new method call has the > %potential of covering more code. > %We therefore built in a brake to prevent these values from > %getting unfeasibly high. We calculated the fitness function > %as: > %\begin{center} > %(number of coverage points covered) $*$ (coverage factor) \\ > % $-$ (number of method calls performed overall) > %\end{center} > %We set the coverage factor to 1000, meaning that we were willing > %to make 1000 more method calls (but not more) if that meant > %covering one more coverage point. > % > %\subsection{Experiment Design} > % > %For subject programs, we used three units from the Java > %1.4.2 edition of {\tt java.util}: {\tt BitSet}, {\tt HashMap} > %and {\tt TreeMap}. These units are widely used, > %and {\tt TreeMap} had been used in the > %experiments of other researchers \cite{visser-etal-issta04}. > %For each UUT, we wrote a > %test wrapper class containing methods that called selected > %target methods of the UUT (16 methods for BitSet, 8 for HashMap > %and 9 for TreeMap). > %Each wrapper contained a simple oracle for checking correctness. > %We instrumented each UUT using Cobertura. > % > %We ran the two-level algorithm 30 times on each of the three > %test wrappers, and recorded the amount of time taken and the > %parameters in the final chromosome. > %To test whether the weights in the chromosomes were > %useful given the length and number of method calls, for each > %final chromosome $c$ we created a variant chromosome $c'$ > %with the same length and number of method calls but with all > %weights equal. We then compared the coverage achieved by > %$c$ and $c'$ on 30 paired trials. Full results from the > %experiment are available in \cite{cli-msc}. > % > %\subsection{Results} > % > %We performed two statistical tests to evaluate whether the > %system was converging on a reasonable solution. First, we > %ordered the average weights discovered for each method in each > %class, and performed a $t$ test > %with Bonferroni correction between each pair of > %adjacent columns. > %We found that for the {\tt HashMap} and {\tt TreeMap} units, the > %{\tt clear} method (which removes all data from the map) had a > %statistically significantly lower weight than the other methods, > %indicating that the algorithm was consistently converging on a > %solution in which it had a lower weight. > %This is because much of the code in these > %units can be executed only when there is a large amount of data > %in the container objects. Since the {\tt clear} method clears > %out all the data, executing it infrequently ensured that > %the objects would get large enough. > % > %We also found that for the {\tt TreeMap} unit, the {\tt remove} > %and {\tt put} methods had a statistically significantly higher > %weight than the other methods. This is explainable by the large > %amount of complex code in these methods and the private methods > %that they call; it takes more calls to > %cover this code than it does for the simpler code of > %the other methods. Another reason is that sequences of {\it put} > %and {\it remove} were needed to create data structures through > %which code in some of the other methods was accessible. > % > %The second statistical test we performed tested > %whether the weights found by the GA were efficient. > %For this, we used the 30 trials comparing the discovered > %chromosome $c$ and the equal-weight variant $c'$. We found that > %for all three units, the equal-weight chromosome covered less > %code than the > %original, to a statistically significant level (as measured by a > %$t$ test with $\alpha=0.05$). This can be interpreted as > %meaning that the GA was correctly choosing a good {\it combination} > %of parameters. > % > %In the course of the experiment, we found a bug in the Java > %1.4.2 version of {\tt BitSet}: when a call to {\tt set()} is > %performed on a range of bits of length 0, the unit could later > %return an incorrect ``length'' for the {\tt BitSet}. > %We found that a bug report for this bug > %had already been submitted to Sun's bug database. It has been > %corrected in the Java 1.5.0 version of the library. > % > %In summary, the experiment indicated that the two-level > %algorithm was potentially useful, and was consistently > %converging on similar solutions that were more optimal than > %calling all methods equally often. > % 924,929c960,966 < The results of our exploratory study encouraged us to expand < the scope of the GA to include method parameter ranges, value < reuse policy and other randomized testing parameters. The < result was the Nighthawk system. < < In this section, we first outline the lower, randomized-testing, --- > Our early exploratory studies~\cite{andrews07} suggested > that > GAs for unit test generation should search > method parameter ranges, value > reuse policy and other randomized testing parameters. > This section describes Nighthawk's implementation of that search. > We first outline the lower, randomized-testing, 939,941c976,977 < Here we present a simplified description of the < algorithm that the lower, randomized-testing, level of Nighthawk < uses to construct and run a test case. The algorithm takes two --- > Nighthawk's lower level > constructs and runs one test case. The algorithm takes two 947,950c983,987 < < We refer to $M$ as the set of ``target methods''. We define the < set $I_M$ of {\it types of interest} corresponding to $M$ as the < union of the following sets of types\footnote{ --- > We say that > $M$ is the set of ``target methods'' and > $I_M$ are {\it types of interest} corresponding to $M$. $I_m$ > is > the union of the following types\footnote{ 970,972c1007,1009 < The algorithm first chooses initial values for primitive type < pools, and then moves on to non-primitive type pools. We define < a constructor method to be an {\it initializer} if it has no --- > The algorithm chooses initial values for primitive type > pools, before considering non-primitive type pools. > A constructor method is an {\it initializer} if it has no 974c1011 < We define a constructor to be a {\it reinitializer} if it has no --- > A constructor is a {\it reinitializer} if it has no 977,978c1014,1015 < methods in $M$ plus the reinitializers of the types in $I_M$. < The callable methods are the ones that Nighthawk calls directly. --- > methods in $M$ plus the reinitializers of the types in $I_M$ > (Nighthawk calls theses {\it callables} directly). 995a1033 > {\scriptsize 1016,1017c1054,1055 < \item Run algorithm {\sf tryRunMethod}$(m, c)$, and < add the call description returned to $k$. --- > \item Run {\sf tryRunMethod}$(m, c)$. Add > the returned call description to $k$. 1023c1061 < --- > } 1028a1067 > {\scriptsize 1049,1051c1088,1090 < \item If the method call threw an {\tt AssertionError}, < return a call description with a failure indication. < \item Otherwise, if the method call threw some other exception, --- > \item If the call throws {\tt AssertionError}, > return a failure indication call description. > \item Otherwise, if the call threw another exception, 1053,1054c1092,1093 < \item Otherwise, if the method return type is not {\tt void}, < and the return value $ret$ is non-null: --- > \item Otherwise, if the method return is not {\tt void}, > \& the return value $ret$ is non-null: 1056,1057c1095,1096 < \item Choose a type $t \in I_M$ which is a supertype of the < type of the return value. --- > \item Choose type $t \in I_M$ that is a supertype of the > the return value. 1060,1062c1099,1101 < type and $ret$ does not violate the bounds on < $p$, then choose an element of $p$ and < replace it by $ret$. --- > type and $ret$ does not violate the $p$ bounds, > then replace an element of $p$ with > $ret$. 1066c1105 < --- > } 1096c1135 < The receiver of a method call cannot be null, and no parameter --- > The receiver of a method call cannot be null. No parameter 1098,1099c1137,1138 < {\sf tryRunMethod} fails to find a non-null value when it is < looking for one, it reports failure of the {\it attempt} to call the --- > {\sf tryRunMethod} cannot find a non-null value, > it reports failure of the {\it attempt} to call the 1121,1122c1160,1161 < \begin{figure} < --- > \begin{figure*} > {\footnotesize 1164,1165c1203,1204 < & \parbox[t]{2in}{The lower and upper bounds on values in the pool; < initial values are drawn uniformly from this range\vspace{1mm}} \\ --- > & \parbox[t]{2in}{Lower and upper bounds on pool values; > initial values are drawn uniformly from this range} \\ 1178,1179c1217,1218 < & \parbox[t]{2in}{Each bit represents one candidate type, and signifies < whether the argument will be of that type\vspace{1mm}} \\ --- > & \parbox[t]{2in}{Each bit represents 1 candidate type, signifying > if the argument will be of that type\vspace{1mm}} \\ 1191c1230 < --- > } 1194c1233 < \end{figure} --- > \end{figure*} 1208c1247 < \item A type is a candidate type for a receiver if it is a --- > \item A type is a {\it receiver candidate type} if it is a 1211c1250 < \item A type is a candidate type for a parameter if it is a --- > \item A type is a {\it parameter candidate type} if it is a 1214c1253 < \item A type is a candidate type for a return value if it is a --- > \item A type is a {\it return value candidate type} if it is a 1219,1220c1258,1259 < {\tt candidateBitSet} and {\tt valuePoolActivityBitSet} < essentially encode a value reuse policy by determining the --- > {\tt candidateBitSet} \& {\tt valuePoolActivityBitSet} > encode value reuse policies by determining the 1261c1300 < cause dramatically different behaviour of the algorithm on the --- > cause dramatically different behavior of the algorithm on the 1265,1267c1304,1306 < If the chromosome specifies < that all three parameter values are to be taken from a value < pool of 65536 values in the range -32768 to 32767, then the --- > If the value pool for > all three parameter values contains > 65536 values in the range -32768 to 32767, then the 1273c1312 < dramatically due to reuse of previously-used values. The amount --- > dramatically due to reuse of previously-used values (the amount 1275c1314 < the UUT, but is probably nonzero. --- > the UUT, but is probably $>0$). 1316,1318c1355,1363 < The {\it fitness function} for a chromosome is calculated in < a manner identical to that of the exploratory study (Section < \ref{exploratory-section}). --- > The {\it fitness function} for a chromosome is: > \begin{center} > (number of coverage points covered) $*$ (coverage factor) \\ > $-$ (number of method calls performed overall) > \end{center} > We set the coverage factor to 1000, meaning that we were willing > to make 1000 more method calls (but not more) if that meant > covering one more coverage point. > 1354c1399 < behaviour is to consider the command-line class names as a set --- > behavior is to consider the command-line class names as a set 1372,1373c1417,1418 < After finding the most fit chromosome, a test engineer can < perform the randomized testing that it specifies. --- > After finding the most fit chromosome, test engineers > apply the specified randomized test. 1394c1439 < A test wrapper for class X is a class that contains one private --- > A test wrapper for class X is a class with one private 1397c1442 < class X. Each wrapper method simply passes the call on to the --- > class X. Each wrapper method passes calls to the 1400,1401c1445,1446 < To customize a test wrapper for precondition checking, the user < can insert a check in the wrapper method before the target --- > To improve test wrapper precondition checking, users can add > checks in a wrapper method before the target 1403c1448 < method can simply return. --- > method just returns. 1409c1454 < %that it throws an {\tt AssertionError} (signalling test case failure) --- > %that it throws an {\tt AssertionError} (signaling test case failure) 1411c1456 < We provide switches to the test wrapper generation program that --- > We offer switches to test wrapper generation program that 1419,1422c1464,1466 < To customize a test wrapper for coverage enhancement, < the user can insert extra methods that cause extra code to be < executed. We provide two switches < for commonly-desired enhancements. The switch \verb/--checkTypedEquals/ --- > To improve test wrapper coverage, > users may add methods to execute extra code. > The switch \verb/--checkTypedEquals/ 1425,1426c1469,1470 < of the wrapped object. This is distinct from the normal wrapper < method that calls {\tt equals}, which has an argument of type --- > of the wrapped object. This differs from normal wrapper > methods that calls {\tt equals}, which has an argument of type 1429,1430c1473,1474 < For classes X that implement their own {\tt equals} method, the < typed-equals method is likely to execute more code. --- > For classes that implement their own {\tt equals} method, the > typed-equals method executes more code. 1449,1452c1493,1497 < To compare the results of our genetic-random approach with those < of the purely genetic approach of < Michael et al.\ \cite{michael-etal-ga-tcg}, we adapted their < published C code for the Triangle program to Java, transforming --- > To compare our genetic-random method against > the purely genetic approach of > Michael et al.\ \cite{michael-etal-ga-tcg}'s > we ported their > C code for the Triangle program to Java, transforming 1455c1500 < Cobertura. We then ran Nighthawk 10 separate times on the --- > Cobertura. Nighthawk was then run 10 times on the 1492c1537 < To compare our results with those of the model-checking approach --- > To compare our method against the model-checking approach 1496,1498c1541,1543 < we downloaded the four data structure units used in those < studies. The units had been hand-instrumented to record < coverage of the deepest basic blocks in the code. --- > we downloaded the data structure units used in those > studies (these had been hand-instrumented to record > coverage of the deepest basic blocks in the code). 1501c1546 < the methods called by the previous researchers. We ran --- > the methods used by previous researchers. We ran 1508c1553 < --- > {\footnotesize 1527c1572 < --- > } 1546,1548c1591,1594 < methods used. It may also be in part due to < the fact that our experiments were run on a different < machine architecture than that of Pacheco et al. --- > methods used (it may also be in part due to > our use of a > different > machine architecture than Pacheco et al.). 1554,1555c1600,1601 < and on the full target classes. When using the full target < classes, Nighthawk was able to cover significantly more lines of --- > and on the full target classes. Using the full target > classes, Nighthawk could cover significantly more lines of 1575a1622 > {\footnotesize 1593c1640 < --- > } 1615c1662 < The results are in Figure \ref{jpf-mcc-results-fig}. We list --- > In Figure \ref{jpf-mcc-results-fig}, we list 1619c1666 < We also list the number of combinations covered by Nighthawk, --- > We also show the combinations covered by Nighthawk, 1629,1630c1676,1677 < In summary, the comparison suggests that < Nighthawk was achieving good coverage with respect to the --- > In summary, our comparisons show > Nighthawk achieving good coverage with respect to the 1662a1710 > {\footnotesize 1706c1754 < --- > } 1714a1763 > {\footnotesize 1756c1805 < --- > } 1765,1766c1814 < Figure \ref{collection-map-cov} shows the results of this study. < The column labelled SLOC shows the total number of source lines --- > The column labeled SLOC in Figure \ref{collection-map-cov} shows the total number of source lines 1786,1788c1834,1836 < in the table < indicate that there are statistically significant differences < between every pair except the (EN, PD) pair. --- > in the table show these > are statistically significant differences > between every pair except (EN, PD). 1791,1792c1839,1840 < class because < the main constructor to {\tt EnumMap} expects --- > class. > The main constructor to {\tt EnumMap} expects 1794c1842 < Nighthawk had no facility for supplying such a type, and so --- > Nighthawk can not find such a type, so 1834,1836d1881 < Here we discuss the threats to validity of the empirical results < in this paper. < 1863c1908 < Time measurement is also a threat to construct validity. --- > Also, time measurement is a construct validity threat. 1865,2045c1910,1911 < gives total wall clock time rather than CPU time. This may give < run time numbers that do not reflect the actual cost to the user < of the testing. < < \section{Data Mining-Based Optimization} < < It is recognized that the design of genetic algorithms is a < ``black art'' \cite{ga-blackart}, and that very little is known < about why GAs work when they do work and why they do not work < when they do not. In the initial design of Nighthawk, we got < the GA to control many aspects of the randomized testing < algorithm, as reflected by the ten gene types listed in Figure < \ref{gene-types-fig}. This was the result of our expectation < that any of them might turn out to be crucial to the success < of the system, and our inability to predict which. < < However, for a genetic algorithm, every gene that is not useful < results in time wasted: time is spent mutating genes that do < not lead to better solutions, and time is also spent extracting < values from genes instead of simply using constant values. < This effect is especially pronounced for Nighthawk, since some < genes control code inside the innermost loops of the randomized < testing algorithm. We were therefore motivated to find whether < we could speed up Nighthawk by eliminating gene types that did < not lead to better performance. < < In this section, we describe how we used feature subset < selection (FSS) to analyze data from Nighthawk runs, in order to < systematically examine the usefulness of each gene type. < We also describe how we optimized the system by eliminating the < least useful genes, resulting in a system that achieved the same < quality of results in less time. < < \subsection{Feature Subset Selection} < < A repeated result in the data mining community is that simpler < models with equivalent or higher performance can be built via < {\em feature subset selection}, algorithms that intelligently < prune useless features \cite{hall03}. Features may be pruned for < several reasons: < \begin{itemize} < \item They may be noisy, i.e.\ contain spurious signals unrelated < to the target class; < \item They may be uninformative, e.g.\ contain mostly one value, < or no repeating values; < \item They may be correlated to other variables, in which case < they can be pruned since their signal is also present in other < variables. < \end{itemize} < < The reduced feature set has many advantages: < \begin{itemize} < \item Miller has shown that models generally containing fewer < variables have less variance in their outputs \cite{miller02}. < \item The smaller the model, the fewer are the demands on < interfaces (sensors and actuators) to the external environment. < Hence, systems designed around small models are easier to use < (less to do) and cheaper to build. < \item In terms of this article, the most important aspect of < learning from a reduced features set is that it produces < smaller models. Such smaller models are easier to explain (or audit). < \end{itemize} < < The literature lists many feature subset selectors. < In the Wrapper method, a {\em target learner} is augmented with a < pre-processor that used a heuristic search to grow subsets of the < available features. At each step in the growth, the target learner < is called to find the accuracy of the model learned from the < current subset. Subset growth is stopped when the addition of new < features did not improve the accuracy. < Kohavi and John \cite{kohavi97} report experiments with Wrapper < where 83\% (on average) of the measures in a domain could be < ignored with only a minimal loss of accuracy. < < The advantage of the Wrapper approach is that, if some target < learner is already implemented, then the Wrapper is simple to < implement. The disadvantage of the wrapper method is that each < step in the heuristic search requires another call to the target < learner. Since there are many steps in such a search ($N$ < features have $2^N$ subsets), Wrappers may be too slow. < < Another feature subset selector is Relief. < This is an instance based learning scheme \cite{Kir92,Kon94} < that works by randomly sampling one instance within the data. It < then locates the nearest neighbors for that instance from not < only the same class but the opposite class as well. The values < of the nearest neighbor features are then compared to that of < the sampled instance and the feature scores are maintained and < updated based on this. This process is specified for some < user-specified $M$ number of instances. Relief can handle noisy < data and other data anomalies by averaging the values for $K$ < nearest neighbors of the same and opposite class for each < instance \cite{Kon94}. For data sets with multiple classes, the < nearest neighbors for each class that is different from that of the < current sampled instance are selected and the contributions are < determined by using the class probabilities of the class in the dataset. < < The experiments of Hall and Holmes \cite{hall03} reject numerous < feature subset selection methods. Wrapper is their preferred < option, but only for small data sets. For larger data sets, the < stochastic nature of Relief makes it a natural choice. < < Another reason to prefer Relief is that it can take advantage < of Nighthawk's stochastic search. Currently, Relief is a < batch process that is executed after data generation is < completed. However, it may not be necessary to wait till the end < of data generation to gain insights into which features are most < relevant. Given the stochastic nature of the algorithm, we can < see feature work where Relief and a genetic algorithm work in < tandem. Consider the random selection process at the core of < Relief: a genetic algorithm exploring mutations of the current < set of parents is an excellent stochastic source of data. In < the future, we plan to apply Relief incrementally and in < parallel to our GAs. < < \subsection{Data Collection and Analysis} < < We instrumented Nighthawk so that every time a chromosome was < evaluated, it printed the current value of every gene and the < final fitness function score. (For the two BitSet gene types, < we printed only the cardinality of the set.) < For each of the 16 Collection and Map classes from {\tt java.util}, < we ran Nighthawk for 40 generations. Each class therefore < yielded 800 observations, each consisting of a gene value vector < and the chromosome score. < < %Relief assumes continuous data and Nighthawk's performance < Relief assumes discrete data and Nighthawk's performance < scores are continuous. To enable feature subset selection, we < first discretized Nighthawk's output. A repeated pattern < across all our experiments is that the Nighthawk scores fall < into three regions: < \begin{itemize} < \item The 65\% majority of the scores are within 30\% of the top < score for any experiment. We call this the {\em high plateau}. < \item A 10\% minority of scores are less than 10\% of the maximum < score. We call this region {\em the hole}. < \item After the high plateau there is a {\em slope} of < increasing gradient that falls into {\em the hole}. This {\em < slope} accounds for 25\% of the results. < \end{itemize} < Accordingly, to select our features, we sorted the results and < divided them into three classes: bottom 10\%, next 25\%, remaining < 65\%. Relief then found the subset of features that distinguished < these three regions. < < Using the above discretization policy, we ran Relief 10 times in a 10-way < cross-validation study. < The data set was divided < into $10$ buckets. Each bucket was (temporarily) removed and Relief was < run on the remaining data. This produced a list of ``merit'' figures for < each feature (this ``merit'' value is an internal heuristic measure < generated from Relief, and reflects the difference ratio of neighboring < instances that have different regions). We took the maximum merit ($Max$) < and generated a {\it Selected} list. A feature was {\it Selected} if its merit < $m$ was $m> \alpha Max$ (e.g. at $\alpha=0.5$ then we selected all < features that score at least half the maximum merit). < < \begin{figure} < < \begin{center} < \begin{tabular}{|c|c|} < $\alpha$ & $| {\it Selected} | $\\\hline < 0.9 & 39 \\ < \hline < 0.8 & 62 \\ < \hline < 0.7 & 112 \\ < \hline < 0.6 & 217 \\ < \hline < 0.5 & 439\\ < \hline < \end{tabular} \vspace{2mm} \\ < \end{center} < < \caption{ < Numbers of selected features for values of $\alpha$. < } < \label{selected-features-fig} < \end{figure} --- > reports total wall clock time, not CPU time. This may show > run times that do not reflect the testing cost to a real user. 2047,2190c1913,1914 < Figure \ref{selected-features-fig} shows the number of features < that were {\it Selected} in < our 19 examples, using different values for $\alpha$. Note that as < $\alpha$ increases, we selected fewer and fewer features. < That is, this method allowed us to discover the genes that were most < important in selecting < for the {\em high plateau} and not the {\em slope} or {\em hole}. < < \begin{figure} < < \begin{center} < \begin{tabular}{|l|l|l|l|l|} < Gene type & $\alpha=0.9$ & $\alpha=0.5$ & Best avg ranks & Best merit \\ < \hline < {\tt candidateBitSet} & 18 & 106 & 1, 1, 1 & 0.373 \\ < \hline < {\tt chanceOfNull} & 0 & 3 & 24.1, 24.3, 25.4 & 0.166 \\ < \hline < {\tt chanceOfTrue} & 2 & 7 & 1, 3.9, 5.7 & 0.150 \\ < \hline < {\tt lowerBound} & 1 & 9 & 4.1, 5.6, 7.2 & 0.129 \\ < \hline < {\tt methodWeight} & 0 & 11 & 7.5, 7.8, 10.6 & 0.144 \\ < \hline < {\tt numberOfCalls} & 0 & 1 & 7.2, 8.4, 16.7 & 0.195 \\ < \hline < {\tt numberOfValuePools} & 0 & 6 & 14.1, 17.9, 20 & 0.186 \\ < \hline < {\tt numberOfValues} & 0 & 28 & 5.2, 5.2, 7.4 & 0.160 \\ < \hline < {\tt upperBound} & 8 & 16 & 1, 1, 1 & 0.267 \\ < \hline < {\tt valuePoolActivityBitSet} & 10 & 252 & 1, 1.4, 2 & 0.267 \\ < \hline < \end{tabular} < \end{center} < < \caption{ < Merit analysis of Nighthawk gene types. < } < \label{genes-merit-fig} < \end{figure} < < Since our goal was to identify gene types that were not useful and < that could therefore possibly be eliminated from chromosomes, we < tabulated the properties of each of the different types of genes. < Figure \ref{genes-merit-fig} shows the result. The three most < useful types of genes for the software under test were clearly < {\tt candidateBitSet}, < {\tt valuePoolActivityBitSet}, and < {\tt upperBound}, < since genes of these types sometimes had merit 90\% or more of < the maximum merit, often were ranked first or second in merit, < and were the only gene types such that some genes of that type < had merit greater than $0.2$ on some runs. This result confirms < our intuition that changing the value reuse policy encoded in < the genes of type < {\tt candidateBitSet} and < {\tt valuePoolActivityBitSet} < is an important way of improving the performance of a Nighthawk < chromosome. < The good performance of {\tt upperBound} is attributable to the < fact that for all the hash table-related units, adjusting this < parameter caused a ``load factor'' parameter in some of the < constructors to be able to take on values that led to the table < being reallocated frequently. < < Just as clearly, the two least useful gene types for the < software under test were < {\tt chanceOfNull} and < {\tt numberOfValuePools}, since neither were ranked better than < than 14th in merit for any run. < This does not indicate that null values and multiple value pools < for each type were not important, only that changing the default < values of these genes (3\% chance of choosing null, and two < value pools) did not usually result in improved performance. < < \subsection{Optimization and Performance Comparison} < < In order to see whether our analysis was useful, we optimized < the Nighthawk code. < We decided to retain the four gene types with < the highest maximum merit < ({\tt candidateBitSet}, < {\tt valuePoolActivityBitSet}, < {\tt upperBound}, and {\tt numberOfCalls}), < and also two other gene types: < {\tt lowerBound}, because it was paired with < {\tt upperBound}; and < {\tt numberOfValues}, because it had the highest maximum < merit of the remaining well-ranked gene types. < < Accordingly, we adjusted Nighthawk's code to delete all < references to the other four gene types < ({\tt chanceOfNull}, < {\tt numberOfValuePools}, < {\tt chanceOfTrue}, and < {\tt methodWeight}). We changed the code that depended on the < values stored for such genes in the current chromosome, so that < it instead used the default initial values for those gene types. < < On each of the 16 Collection and Map classes, < we performed one run of the original Nighthawk, < and one run of the new, optimized version. < For each version of Nighthawk, we then < performed the same analysis that is reflected in Figures < \ref{collection-map-cov} and < \ref{collection-times}; < that is, we asked RunChromosome to run 10 test cases with the < winning chromosome and measured the coverage, and we calculated < the clock time taken by Nighthawk to achieve the highest < coverage it achieved. We then compared the original and < optimized Nighthawk. < < The chromosomes resulting from the optimized Nighthawk achieved < slightly higher coverage than those from the original, though a < $t$ test concluded that the difference was not statistically < significant ($p=0.058$). However, the optimized Nighthawk was < faster to a statistically significant degree ($p=0.010$). The < time ratio between the original and optimized systems < was 1.46, i.e.\ the < optimized system was 46\% faster. We can therefore say that < the process of analyzing the usefulness of the genes resulted < in a system that ran substantially more quickly but achieved the < same high coverage of the SUT. < < \subsection{Discussion} < < The feature subset selection and optimization procedure worked < well for us when we collected data from a set of classes and < then optimized Nighthawk for those classes. This does not mean < that the optimized Nighthawk would necessarily work < well for other classes. For instance, for some classes there < might be a great deal of code that is accessible only by giving < null pointers as parameters; we could expect our newly optimized < version of Nighthawk to perform poorly on such classes, since < there would be a fixed 3\% chance of choosing null. < < However, the results of the optimization exercise indicate that < FSS-based analysis is a valuable adjunct to the GA in this < application. It is for this reason that we envision in the < future integrating FSS into a genetic-random testing algorithm, < so that the algorithm can improve its performance dynamically, < while GA mutation and recombination is going on. --- > %\input{fssold} > \input{fssnew} 2205,2207c1929,1940 < We have also shown that we were able to simplify the design < of the GA system and improve its runtime using automatic < feature subset selection. --- > > We have also shown that we were able to optimize and > simplify > meta-heuristic search tools. > Metaheuristic tools (such as genetic algorithms and simulated > annealers) typically mutate some aspect of a candidate solution > and evaluate the results. If the effect of mutating each aspect > is recorded, then each aspect can be considered a feature and > is amenable to the FSS processing described here. > In this way, > FSS can be used to automatically find and remove > superflous parts of the search control. 2215a1949,1956 > Further, we can see a promising line of research were the cost/benefits > of a particular meta-heuristic are tuned to the particulars of a specific problem. > Here, we have shown that if we surendeer $\frac{1}{10}$-th of the coverage, we can > run Nighthawk ten times faster. While this is an acceptable trade-off in many domains, > it may be depreciated in > safety critical applciations. > More work is required to understand how to best match meta-heuristics (with or without FSS) > to particular problem domains. 2316c2057 < % use section* for acknowledgement --- > % use section* for acknowledgment