49,50c49,50
< \documentclass[12pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran}
< %\documentclass[9.5pt,journal,final,finalsubmission,twocolumn]{IEEEtran}
---
> %\documentclass[12pt,journal,draftcls,letterpaper,onecolumn]{IEEEtran}
> \documentclass[9.5pt,journal,final,finalsubmission,twocolumn]{IEEEtran}
198,199c198,208
< 
< 
---
> \newcommand{\bi}{\begin{itemize}}
> \newcommand{\ei}{\end{itemize}}
> \newcommand{\bd}{\begin{description}}
> \newcommand{\ed}{\end{description}}
> \newcommand{\bbi}{\begin{itemize}}
> \newcommand{\eei}{\end{itemize}}
> \newcommand{\be}{\begin{enumerate}}
> \newcommand{\ee}{\end{enumerate}}
> \newcommand{\fig}[1]{Figure~\ref{fig:#1}}
> \newcommand{\eq}[1]{Equation~\ref{eq:#1}}
> \newcommand{\tion}[1]{\S\ref{sec:#1}}
203c212
< \title{Using a Genetic Algorithm to Control Randomized Unit Testing}
---
> \title{Controlling Unit Testing With Genetic Algorithms }
256,257c265,266
<   Randomized testing has been shown to be an effective method for
<   testing software units.  However, the thoroughness of randomized
---
>   Randomized testing is an effective method for
>   testing software units.   Thoroughness of randomized
260,264c269,272
<   are called.  In this paper, we describe a system which uses a
<   genetic algorithm to find parameters for randomized unit testing
<   that optimize test coverage.  We compare our coverage results to
<   previous work, and report on case studies and experiments on
<   system options.  We show that we can achieve the
---
>   are called.  Hence, we describe NIGHTHAWK, a  system which uses a
>   genetic algorithm (GA) to find parameters for randomized unit testing
>   that optimize test coverage.  
> We show that NIGHTHAWK achieves the
267,270c275,283
<   Finally, we describe how we used data
<   mining techniques to analyze which genes were the most useful,
<   and used this information to optimize the system so that it
<   ran more quickly with no loss of coverage.
---
> 
> Designing  GAs is somewhat of a black art. We therefore use
> a feature subset selection (FSS)
> tool to assessing the mutators inside a GA. Using that tool,
> we can prune back 90\% of our mutators, while still achieving most
> of the coverage found using  all the mutators. Our pruned GA achieves very near
> the same results as the full system,  but in only $\frac{1}{10}$-th the time.
> These pruning results suggest that FSS for mutator pruning could significantly
> optimized meta-heuristic search-based software engineering tools.
275c288
< data mining.
---
> feature subset selection.
308c321
< The thoroughness of randomized unit testing is highly dependent
---
> The thoroughness of randomized unit testing is dependent
315c328
< which we refer to as the {\it value reuse policy}, is also
---
> which we call the {\it value reuse policy}, is also
319,320c332,333
< In this paper, we describe Nighthawk, a system for generating
< unit test data.  The system can be viewed as consisting of two
---
> This paper describes the Nighthawk unit test data generator.
> Nighthawk has two
323c336
< specified as genes in a chromosome, including parameters that
---
> specified as genes in a chromozone, including parameters that
326c339
< mutation and recombination of chromosomes to find good values
---
> mutation and recombination of chromozones to find good values
329d341
< 
335a348,355
> This paper also discusses optimization methods for GA tools like
> Nighthawk. Using feature subset selection methods, we show that
> we can prune many of Nighthawk's mutator operators without
> compromising the coverage generated by the tool. The pruned Nighthawk
> tool achieves very nearly the same coverage as full Nighthawk (90\%)
> and does so ten times faster. Therefore, we recommend that 
> meta-heuristic
> search-based SE tools should also routinely feature subset selection.
385,386c405,406
< the pools according to a policy specified in a chromosome.
< The chromosome also specifies relative frequencies of methods,
---
> the pools according to a policy specified in a chromozone.
> The chromozone also specifies relative frequencies of methods,
397c417
< trying different chromosomes.  To simplify that process, we
---
> trying different chromozomes.  To simplify that process, we
414,415c434,435
< \item We compare Nighthawk to other systems described in
<   previous research, showing that it can achieve the same
---
> \item We compare Nighthawk to 
>   prior research, showing that it can achieve the same
422,429c442,449
< \item We describe how we optimized Nighthawk by using FSS to
<   systematically analyze which genes have the greatest effect on
<   the fitness
<   of a chromosome, and eliminating genes that we found to have
<   little effect.  We show that the optimized system achieved
<   the same good results in significantly less time, demonstrating
<   the utility of augmenting GAs with automatic feature subset
<   selection.
---
> \item We demonstrate the value of feature subset selection (FSS)
> for optimizing genetic algorithms.
> Using FSS,
> we can prune  many of
> Nighthawk's mutators while achieving nearly the coverage (90\%).
> Compared to the runtimes of full Nighthawk,
> this coverage is achieved 
> ten times faster.
435,440c455,470
< could find useful parameter settings.  In section 4, we describe
< the design and use of Nighthawk.  Section 5 contains our
< comparison to previous work, and section 6 our case study;
< section 7 contains a discussion of the threats to validity
< of the empirical work in the paper.  The procedure and results
< of our optimization are in Section 8.
---
> could find useful parameter settings.  
> %In section 4, we describe
> %the design and use of Nighthawk.  Section 5 contains our
> %comparison to previous work, and section 6 our case study;
> %section 7 contains a discussion of the threats to validity
> %of the empirical work in the paper.  Mutator pruning
> %is described in
> %Section 8.
> %
> In section 4, we describe
> the design and use of Nighthawk.  
> Section 5 describes our case study and
> section 6 contains a discussion of the threats to validity
> of the empirical work in the paper.  Mutator pruning
> with feature subset selection is described in
> Section 7.
525,527c555,559
< Finally, the aforementioned research by Pacheco
< et al.\ \cite{pacheco-etal-icse2007} enhances the thoroughness
< of randomized testing by doing a partial randomized
---
> Also, Pacheco
> et al.\ \cite{pacheco-etal-icse2007} also show randomized
> testing can be
> enhanced via
> partial randomized
549,552c581,585
< Approaches to test data generation via symbolic execution have
< existed as far back as 1976
< \cite{clarke-76-testdata,king-symbolic-1976}.
< Such approaches typically attempt to generate a thorough set of
---
> Approaches to test data generation via symbolic execution 
> date back
> to 1976
> \cite{clarke-76-testdata,king-symbolic-1976}; 
> typically to generate a thorough set of
560,561c593,594
< Other source code analysis-based approaches have used such
< methods as iterative relaxation of a set of constraints on
---
> Other source code analysis tools have applied
> iterative relaxation of a set of constraints on
573c606
< Some analysis-based approaches are limited in the range of
---
> Some analysis-based approaches limit the range of
576c609
< usefully to conditions involving pointers.  In addition, most
---
> to conditions involving pointers.  In addition, most
586,588c619,623
< \cite{holland-ga-book}.  GAs typically represent a solution to a
< problem as a ``chromosome'', with various aspects of the
< solution to the problem represented as ``genes'' in the
---
> \cite{holland-ga-book}.  Candidate
> solutions
> are represented
> as a ``chromosome'', with various solution options
> represented as ``genes'' in the
595c630
< GAs have been found to be superior to
---
> GAs can defeat
688,689c723,724
< %\includegraphics[width=2.5in]{myfigure.eps}
< \includegraphics[width=3in]{xyPlot.eps}
---
> %\includegraphics[width=2.5in]{myfigure.pdf}
> \includegraphics[width=3in]{xyPlot.pdf}
705,706c740,741
< to GA white-box test data generation attempt to address this
< problem by proposing fitness functions that detect ``how close''
---
> to GA white-box test data generation address this
> problem with fitness functions that detect ``how close''
719c754
< \includegraphics[width=3in]{lohiPlot.eps}
---
> \includegraphics[width=3in]{lohiPlot.pdf}
735,736c770,772
< In all the approaches to GA-based test input generation that
< we are aware of, each run of the GA results in a single
---
> To the best of our knowledge, all GA-based
> test input generators
> results in a single
780c816
< in the source language.  These complex tools are not often
---
> in the source language.  Such tools are not often
788c824
< the fact that the source code of many of the units had been
---
> the source code of many of the units had been
791c827
< such as the ability to analyze multithreaded code
---
> such as the ability to analyze multi-threaded code
799,921c835,957
< 
< \section{Exploratory Study}
< \label{exploratory-section}
< 
< To find out whether there was any merit in the idea of a
< genetic-random system, we conducted an exploratory study.
< In this section, we describe the prototype software we developed,
< the design of the study and its results.
< 
< \subsection{Software Developed}
< 
< Using code from RUTE-J (see above) and the open-source genetic
< algorithm package JDEAL \cite{jdeal}, we constructed a prototype
< two-level genetic-random unit testing system that took Java
< classes as its testing units.  For each unit under test (UUT) with
< $n$ methods to call, the GA level constructed a chromosome with
< $2+n$ integer genes; these genes represented
< the number of method calls to make in each
< test case, the number of test cases to generate, and the
< relative weights (calling frequencies) of the $n$ methods.  All
< other randomized testing parameters
< were hard-coded in the test wrappers.
< 
< The evaluation of the fitness of each chromosome $c$ proceeded as
< follows.  We got the random testing level to generate the number
< of test cases of the length specified in $c$, using
< the method weights specified in $c$.  We then
< measured the number of coverage points covered using
< Cobertura \cite{cobertura-website},
< which measures line coverage.  If we had based the
< fitness function {\it only} on coverage, however, then any chromosome
< would have benefitted from having a larger number of
< method calls and test cases, since every new method call has the
< potential of covering more code.
< We therefore built in a brake to prevent these values from
< getting unfeasibly high.  We calculated the fitness function
< as:
< \begin{center}
< (number of coverage points covered) $*$ (coverage factor) \\
<  $-$ (number of method calls performed overall)
< \end{center}
< We set the coverage factor to 1000, meaning that we were willing
< to make 1000 more method calls (but not more) if that meant
< covering one more coverage point.
< 
< \subsection{Experiment Design}
< 
< We chose as our subject programs three units taken from the Java
< 1.4.2 edition of {\tt java.util}: {\tt BitSet}, {\tt HashMap}
< and {\tt TreeMap}.  These units were clearly in wide use,
< and {\tt TreeMap} had been used as the basis of earlier
< experiments by other researchers \cite{visser-etal-issta04}.
< For each UUT, we wrote a
< test wrapper class containing methods that called selected
< target methods of the UUT (16 methods for BitSet, 8 for HashMap
< and 9 for TreeMap).
< Each wrapper contained a simple oracle for checking correctness.
< We instrumented each UUT using Cobertura.
< 
< We ran the two-level algorithm 30 times on each of the three
< test wrappers, and recorded the amount of time taken and the
< parameters in the final chromosome.
< To test whether the weights in the chromosomes were
< useful given the length and number of method calls, for each
< final chromosome $c$ we created a variant chromosome $c'$
< with the same length and number of method calls but with all
< weights equal.  We then compared the coverage achieved by
< $c$ and $c'$ on 30 paired trials.  Full results from the
< experiment are available in \cite{cli-msc}.
< 
< \subsection{Results}
< 
< We performed two statistical tests to evaluate whether the
< system was converging on a reasonable solution.  First, we
< ordered the average weights discovered for each method in each
< class, and performed a $t$ test
< with Bonferroni correction between each pair of
< adjacent columns.
< We found that for the {\tt HashMap} and {\tt TreeMap} units, the
< {\tt clear} method (which removes all data from the map) had a
< statistically significantly lower weight than the other methods,
< indicating that the algorithm was consistently converging on a
< solution in which it had a lower weight.
< This is because much of the code in these
< units can be executed only when there is a large amount of data
< in the container objects.  Since the {\tt clear} method clears
< out all the data, executing it infrequently ensured that
< the objects would get large enough.
< 
< We also found that for the {\tt TreeMap} unit, the {\tt remove}
< and {\tt put} methods had a statistically significantly higher
< weight than the other methods.  This is explainable by the large
< amount of complex code in these methods and the private methods
< that they call; it takes more calls to
< cover this code than it does for the simpler code of
< the other methods.  Another reason is that sequences of {\it put}
< and {\it remove} were needed to create data structures through
< which code in some of the other methods was accessible.
< 
< The second statistical test we performed tested
< whether the weights found by the GA were efficient.
< For this, we used the 30 trials comparing the discovered
< chromosome $c$ and the equal-weight variant $c'$.  We found that
< for all three units, the equal-weight chromosome covered less
< code than the
< original, to a statistically significant level (as measured by a
< $t$ test with $\alpha=0.05$).  This can be interpreted as
< meaning that the GA was correctly choosing a good {\it combination}
< of parameters.
< 
< In the course of the experiment, we found a bug in the Java
< 1.4.2 version of {\tt BitSet}:  when a call to {\tt set()} is
< performed on a range of bits of length 0, the unit could later
< return an incorrect ``length'' for the {\tt BitSet}.
< We found that a bug report for this bug
< had already been submitted to Sun's bug database.  It has been
< corrected in the Java 1.5.0 version of the library.
< 
< In summary, the experiment indicated that the two-level
< algorithm was potentially useful, and was consistently
< converging on similar solutions that were more optimal than
< calling all methods equally often.
< 
---
> %
> %\section{Exploratory Study}
> %\label{exploratory-section}
> %
> %To test the merit of a 
> %genetic-random system, we conducted an exploratory study.
> %In this section, we describe the prototype software we developed,
> %the design of the study and its results.
> %
> %\subsection{Software Developed}
> %
> %Using code from RUTE-J (see above) and the open-source genetic
> %algorithm package JDEAL \cite{jdeal}, we constructed a prototype
> %two-level genetic-random unit testing system that took Java
> %classes as its testing units.  For each unit under test (UUT) with
> %$n$ methods to call, the GA level constructed a chromosome with
> %$2+n$ integer genes; these genes represented
> %the number of method calls to make in each
> %test case, the number of test cases to generate, and the
> %relative weights (calling frequencies) of the $n$ methods.  All
> %other randomized testing parameters
> %were hard-coded in the test wrappers.
> %
> %The evaluation of the fitness of each chromosome $c$ proceeded as
> %follows.  We got the random testing level to generate the number
> %of test cases of the length specified in $c$, using
> %the method weights specified in $c$.  We then
> %measured the number of coverage points covered using
> %Cobertura \cite{cobertura-website},
> %which measures line coverage.  If we had based the
> %fitness function {\it only} on coverage, however, then any chromosome
> %would have benefitted from having a larger number of
> %method calls and test cases, since every new method call has the
> %potential of covering more code.
> %We therefore built in a brake to prevent these values from
> %getting unfeasibly high.  We calculated the fitness function
> %as:
> %\begin{center}
> %(number of coverage points covered) $*$ (coverage factor) \\
> % $-$ (number of method calls performed overall)
> %\end{center}
> %We set the coverage factor to 1000, meaning that we were willing
> %to make 1000 more method calls (but not more) if that meant
> %covering one more coverage point.
> %
> %\subsection{Experiment Design}
> %
> %For subject programs, we used three units from the Java
> %1.4.2 edition of {\tt java.util}: {\tt BitSet}, {\tt HashMap}
> %and {\tt TreeMap}.  These units are widely used,
> %and {\tt TreeMap} had been used in the
> %experiments of other researchers \cite{visser-etal-issta04}.
> %For each UUT, we wrote a
> %test wrapper class containing methods that called selected
> %target methods of the UUT (16 methods for BitSet, 8 for HashMap
> %and 9 for TreeMap).
> %Each wrapper contained a simple oracle for checking correctness.
> %We instrumented each UUT using Cobertura.
> %
> %We ran the two-level algorithm 30 times on each of the three
> %test wrappers, and recorded the amount of time taken and the
> %parameters in the final chromosome.
> %To test whether the weights in the chromosomes were
> %useful given the length and number of method calls, for each
> %final chromosome $c$ we created a variant chromosome $c'$
> %with the same length and number of method calls but with all
> %weights equal.  We then compared the coverage achieved by
> %$c$ and $c'$ on 30 paired trials.  Full results from the
> %experiment are available in \cite{cli-msc}.
> %
> %\subsection{Results}
> %
> %We performed two statistical tests to evaluate whether the
> %system was converging on a reasonable solution.  First, we
> %ordered the average weights discovered for each method in each
> %class, and performed a $t$ test
> %with Bonferroni correction between each pair of
> %adjacent columns.
> %We found that for the {\tt HashMap} and {\tt TreeMap} units, the
> %{\tt clear} method (which removes all data from the map) had a
> %statistically significantly lower weight than the other methods,
> %indicating that the algorithm was consistently converging on a
> %solution in which it had a lower weight.
> %This is because much of the code in these
> %units can be executed only when there is a large amount of data
> %in the container objects.  Since the {\tt clear} method clears
> %out all the data, executing it infrequently ensured that
> %the objects would get large enough.
> %
> %We also found that for the {\tt TreeMap} unit, the {\tt remove}
> %and {\tt put} methods had a statistically significantly higher
> %weight than the other methods.  This is explainable by the large
> %amount of complex code in these methods and the private methods
> %that they call; it takes more calls to
> %cover this code than it does for the simpler code of
> %the other methods.  Another reason is that sequences of {\it put}
> %and {\it remove} were needed to create data structures through
> %which code in some of the other methods was accessible.
> %
> %The second statistical test we performed tested
> %whether the weights found by the GA were efficient.
> %For this, we used the 30 trials comparing the discovered
> %chromosome $c$ and the equal-weight variant $c'$.  We found that
> %for all three units, the equal-weight chromosome covered less
> %code than the
> %original, to a statistically significant level (as measured by a
> %$t$ test with $\alpha=0.05$).  This can be interpreted as
> %meaning that the GA was correctly choosing a good {\it combination}
> %of parameters.
> %
> %In the course of the experiment, we found a bug in the Java
> %1.4.2 version of {\tt BitSet}:  when a call to {\tt set()} is
> %performed on a range of bits of length 0, the unit could later
> %return an incorrect ``length'' for the {\tt BitSet}.
> %We found that a bug report for this bug
> %had already been submitted to Sun's bug database.  It has been
> %corrected in the Java 1.5.0 version of the library.
> %
> %In summary, the experiment indicated that the two-level
> %algorithm was potentially useful, and was consistently
> %converging on similar solutions that were more optimal than
> %calling all methods equally often.
> %
924,929c960,966
< The results of our exploratory study encouraged us to expand
< the scope of the GA to include method parameter ranges, value
< reuse policy and other randomized testing parameters.  The
< result was the Nighthawk system.
< 
< In this section, we first outline the lower, randomized-testing,
---
> Our early exploratory studies~\cite{andrews07} suggested
> that 
> GAs for unit test generation should  search
> method parameter ranges, value
> reuse policy and other randomized testing parameters.  
> This section describes Nighthawk's implementation of that search.
> We first outline the lower, randomized-testing,
939,941c976,977
< Here we present a simplified description of the
< algorithm that the lower, randomized-testing, level of Nighthawk
< uses to construct and run a test case.  The algorithm takes two
---
> Nighthawk's lower level
> constructs and runs one test case.  The algorithm takes two
947,950c983,987
< 
< We refer to $M$ as the set of ``target methods''.  We define the
< set $I_M$ of {\it types of interest} corresponding to $M$ as the
< union of the following sets of types\footnote{
---
> We say that 
> $M$ is the set of ``target methods'' and
> $I_M$ are {\it types of interest} corresponding to $M$. $I_m$
> is 
> the union of the following types\footnote{
970,972c1007,1009
< The algorithm first chooses initial values for primitive type
< pools, and then moves on to non-primitive type pools.  We define
< a constructor method to be an {\it initializer} if it has no
---
> The algorithm chooses initial values for primitive type
> pools, before considering  non-primitive type pools.  
> A constructor method is an {\it initializer} if it has no
974c1011
< We define a constructor to be a {\it reinitializer} if it has no
---
> A constructor is a {\it reinitializer} if it has no
977,978c1014,1015
< methods in $M$ plus the reinitializers of the types in $I_M$.
< The callable methods are the ones that Nighthawk calls directly.
---
> methods in $M$ plus the reinitializers of the types in $I_M$
> (Nighthawk calls theses {\it callables} directly).
995a1033
> {\scriptsize
1016,1017c1054,1055
<   \item Run algorithm {\sf tryRunMethod}$(m, c)$, and
<         add the call description returned to $k$.
---
>   \item Run {\sf tryRunMethod}$(m, c)$. Add
>         the returned call description to $k$.
1023c1061
< 
---
> }
1028a1067
> {\scriptsize
1049,1051c1088,1090
< \item If the method call threw an {\tt AssertionError},
<   return a call description with a failure indication.
< \item Otherwise, if the method call threw some other exception,
---
> \item If the call throws {\tt AssertionError},
>   return a failure indication call description.
> \item Otherwise, if the call threw  another exception,
1053,1054c1092,1093
< \item Otherwise, if the method return type is not {\tt void},
<   and the return value $ret$ is non-null:
---
> \item Otherwise, if the method return is not {\tt void},
>   \& the return value $ret$ is non-null:
1056,1057c1095,1096
<   \item Choose a type $t \in I_M$ which is a supertype of the
<     type of the return value.
---
>   \item Choose  type $t \in I_M$ that is a supertype of the
>     the return value.
1060,1062c1099,1101
<     type and $ret$ does not violate the bounds on
<     $p$, then choose an element of $p$ and
<     replace it by $ret$.
---
>     type and $ret$ does not violate the $p$ bounds,
>     then replace an element of $p$ with
>     $ret$.
1066c1105
< 
---
> }
1096c1135
< The receiver of a method call cannot be null, and no parameter
---
> The receiver of a method call cannot be null. No parameter
1098,1099c1137,1138
< {\sf tryRunMethod} fails to find a non-null value when it is
< looking for one, it reports failure of the {\it attempt} to call the
---
> {\sf tryRunMethod} cannot find a non-null value,
> it reports failure of the {\it attempt} to call the
1121,1122c1160,1161
< \begin{figure}
< 
---
> \begin{figure*}
> {\footnotesize
1164,1165c1203,1204
<   & \parbox[t]{2in}{The lower and upper bounds on values in the pool;
<                  initial values are drawn uniformly from this range\vspace{1mm}} \\
---
>   & \parbox[t]{2in}{Lower and upper bounds on pool values;
>                  initial values are drawn uniformly from this range} \\
1178,1179c1217,1218
<   & \parbox[t]{2in}{Each bit represents one candidate type, and signifies
<                  whether the argument will be of that type\vspace{1mm}} \\
---
>   & \parbox[t]{2in}{Each bit represents 1 candidate type,  signifying
>                  if the argument will be of that type\vspace{1mm}} \\
1191c1230
< 
---
> }
1194c1233
< \end{figure}
---
> \end{figure*}
1208c1247
< \item A type is a candidate type for a receiver if it is a
---
> \item A type is a {\it receiver candidate type} if it is a
1211c1250
< \item A type is a candidate type for a parameter if it is a
---
> \item A type is a {\it parameter candidate type} if it is a
1214c1253
< \item A type is a candidate type for a return value if it is a
---
> \item A type is a {\it return value candidate type} if it is a
1219,1220c1258,1259
< {\tt candidateBitSet} and {\tt valuePoolActivityBitSet}
< essentially encode a value reuse policy by determining the
---
> {\tt candidateBitSet} \& {\tt valuePoolActivityBitSet}
> encode  value reuse policies by determining the
1261c1300
< cause dramatically different behaviour of the algorithm on the
---
> cause dramatically different behavior of the algorithm on the
1265,1267c1304,1306
< If the chromosome specifies
< that all three parameter values are to be taken from a value
< pool of 65536 values in the range -32768 to 32767, then the
---
> If the value pool for
> all three parameter values contains
> 65536 values in the range -32768 to 32767, then the
1273c1312
< dramatically due to reuse of previously-used values.  The amount
---
> dramatically due to reuse of previously-used values  (the amount
1275c1314
< the UUT, but is probably nonzero.
---
> the UUT, but is probably $>0$).
1316,1318c1355,1363
< The {\it fitness function} for a chromosome is calculated in
< a manner identical to that of the exploratory study (Section
< \ref{exploratory-section}).
---
> The {\it fitness function} for a chromosome is:
> \begin{center}
> (number of coverage points covered) $*$ (coverage factor) \\
>  $-$ (number of method calls performed overall)
> \end{center}
> We set the coverage factor to 1000, meaning that we were willing
> to make 1000 more method calls (but not more) if that meant
> covering one more coverage point.
> 
1354c1399
< behaviour is to consider the command-line class names as a set
---
> behavior is to consider the command-line class names as a set
1372,1373c1417,1418
< After finding the most fit chromosome, a test engineer can
< perform the randomized testing that it specifies.
---
> After finding the most fit chromosome, test engineers
> apply the specified randomized test.
1394c1439
< A test wrapper for class X is a class that contains one private
---
> A test wrapper for class X is a class with one private
1397c1442
< class X.  Each wrapper method simply passes the call on to the
---
> class X.  Each wrapper method passes calls to the
1400,1401c1445,1446
< To customize a test wrapper for precondition checking, the user
< can insert a check in the wrapper method before the target
---
> To improve test wrapper precondition checking, users can add
> checks in a wrapper method before the target
1403c1448
< method can simply return.
---
> method just returns.
1409c1454
< %that it throws an {\tt AssertionError} (signalling test case failure)
---
> %that it throws an {\tt AssertionError} (signaling test case failure)
1411c1456
< We provide switches to the test wrapper generation program that
---
> We offer switches to test wrapper generation program that
1419,1422c1464,1466
< To customize a test wrapper for coverage enhancement,
< the user can insert extra methods that cause extra code to be
< executed.  We provide two switches
< for commonly-desired enhancements.  The switch \verb/--checkTypedEquals/
---
> To improve test wrapper coverage,
> users may add methods to execute extra code.
> The switch \verb/--checkTypedEquals/
1425,1426c1469,1470
< of the wrapped object.  This is distinct from the normal wrapper
< method that calls {\tt equals}, which has an argument of type
---
> of the wrapped object.  This differs from normal wrapper
> methods that calls {\tt equals}, which has an argument of type
1429,1430c1473,1474
< For classes X that implement their own {\tt equals} method, the
< typed-equals method is likely to execute more code.
---
> For classes that implement their own {\tt equals} method, the
> typed-equals method executes more code.
1449,1452c1493,1497
< To compare the results of our genetic-random approach with those
< of the purely genetic approach of
< Michael et al.\ \cite{michael-etal-ga-tcg}, we adapted their
< published C code for the Triangle program to Java, transforming
---
> To compare our genetic-random method against
> the purely genetic approach of
> Michael et al.\ \cite{michael-etal-ga-tcg}'s
> we ported their
> C code for the Triangle program to Java, transforming
1455c1500
< Cobertura.  We then ran Nighthawk 10 separate times on the
---
> Cobertura.  Nighthawk was then run 10 times on the
1492c1537
< To compare our results with those of the model-checking approach
---
> To compare our method against the model-checking approach
1496,1498c1541,1543
< we downloaded the four data structure units used in those
< studies.  The units had been hand-instrumented to record
< coverage of the deepest basic blocks in the code.
---
> we downloaded the data structure units used in those
> studies  (these had been hand-instrumented to record
> coverage of the deepest basic blocks in the code).
1501c1546
< the methods called by the previous researchers.  We ran
---
> the methods used by previous researchers.  We ran
1508c1553
< 
---
> {\footnotesize
1527c1572
< 
---
> }
1546,1548c1591,1594
< methods used.  It may also be in part due to
< the fact that our experiments were run on a different
< machine architecture than that of Pacheco et al.
---
> methods used (it may also be in part due to
> our use of a 
> different
> machine architecture than Pacheco et al.).
1554,1555c1600,1601
< and on the full target classes.  When using the full target
< classes, Nighthawk was able to cover significantly more lines of
---
> and on the full target classes.   Using the full target
> classes, Nighthawk could cover significantly more lines of
1575a1622
> {\footnotesize
1593c1640
< 
---
> }
1615c1662
< The results are in Figure \ref{jpf-mcc-results-fig}.  We list
---
> In Figure \ref{jpf-mcc-results-fig},  we list
1619c1666
< We also list the number of combinations covered by Nighthawk,
---
> We also show the combinations covered by Nighthawk,
1629,1630c1676,1677
< In summary, the comparison suggests that
< Nighthawk was achieving good coverage with respect to the
---
> In summary, our comparisons show
> Nighthawk achieving good coverage with respect to the
1662a1710
> {\footnotesize
1706c1754
< 
---
> }
1714a1763
> {\footnotesize
1756c1805
< 
---
> }
1765,1766c1814
< Figure \ref{collection-map-cov} shows the results of this study.
< The column labelled SLOC shows the total number of source lines
---
> The column labeled SLOC in Figure \ref{collection-map-cov} shows the total number of source lines
1786,1788c1834,1836
< in the table
< indicate that there are statistically significant differences
< between every pair except the (EN, PD) pair.
---
> in the table show these 
> are statistically significant differences
> between every pair except (EN, PD).
1791,1792c1839,1840
< class because
< the main constructor to {\tt EnumMap} expects
---
> class. 
> The main constructor to {\tt EnumMap} expects
1794c1842
< Nighthawk had no facility for supplying such a type, and so
---
> Nighthawk can not find such a type, so
1834,1836d1881
< Here we discuss the threats to validity of the empirical results
< in this paper.
< 
1863c1908
< Time measurement is also a threat to construct validity. 
---
> Also, time measurement is a construct validity threat. 
1865,2045c1910,1911
< gives total wall clock time rather than CPU time.  This may give
< run time numbers that do not reflect the actual cost to the user
< of the testing.
< 
< \section{Data Mining-Based Optimization}
< 
< It is recognized that the design of genetic algorithms is a
< ``black art'' \cite{ga-blackart}, and that very little is known
< about why GAs work when they do work and why they do not work
< when they do not.  In the initial design of Nighthawk, we got
< the GA to control many aspects of the randomized testing
< algorithm, as reflected by the ten gene types listed in Figure
< \ref{gene-types-fig}.  This was the result of our expectation
< that any of them might turn out to be crucial to the success
< of the system, and our inability to predict which.
< 
< However, for a genetic algorithm, every gene that is not useful
< results in time wasted:  time is spent mutating genes that do
< not lead to better solutions, and time is also spent extracting
< values from genes instead of simply using constant values.
< This effect is especially pronounced for Nighthawk, since some
< genes control code inside the innermost loops of the randomized
< testing algorithm.  We were therefore motivated to find whether
< we could speed up Nighthawk by eliminating gene types that did
< not lead to better performance.
< 
< In this section, we describe how we used feature subset
< selection (FSS) to analyze data from Nighthawk runs, in order to
< systematically examine the usefulness of each gene type.
< We also describe how we optimized the system by eliminating the
< least useful genes, resulting in a system that achieved the same
< quality of results in less time.
< 
< \subsection{Feature Subset Selection}
< 
< A repeated result in the data mining community is that simpler
< models with  equivalent or higher performance can be built via
< {\em feature subset selection}, algorithms that intelligently
< prune useless features \cite{hall03}. Features may be pruned for
< several reasons: 
< \begin{itemize}
< \item They may be noisy, i.e.\ contain spurious signals unrelated
<   to the target class; 
< \item They may be uninformative, e.g.\ contain mostly one value,
<   or no repeating values;
< \item They may be correlated to other variables, in which case
<   they can be pruned since their signal is also present in other
<   variables.
< \end{itemize}
< 
< The reduced feature set has many advantages:
< \begin{itemize}
< \item Miller has shown that models generally containing fewer
<   variables have less variance in their outputs \cite{miller02}.
< \item The smaller the model, the fewer are the demands on
<   interfaces  (sensors and actuators) to the external environment.
<   Hence, systems designed around  small models are easier to use
<   (less to do) and cheaper to build.
< \item In terms of this article, the most important aspect of
<   learning from a reduced features set is that it produces
<   smaller models. Such smaller models are easier to explain (or audit).
< \end{itemize}
< 
< The literature lists many feature subset selectors.  
< In the Wrapper method, a {\em target learner} is augmented with a
< pre-processor that used a heuristic search to grow subsets of the
< available features. At each step in the growth, the target learner
< is called  to find the accuracy of the model learned from the
< current subset. Subset growth is stopped when the addition of new
< features did not improve the accuracy. 
< Kohavi and John \cite{kohavi97} report experiments with Wrapper
< where 83\% (on average) of the measures in a domain could be
< ignored with only a minimal loss of accuracy.
< 
< The advantage of the Wrapper approach is that, if some target
< learner is already implemented, then the Wrapper is simple to
< implement. The disadvantage of the wrapper method is that each
< step in the heuristic search requires another call to the target
< learner. Since there are many steps in such a search ($N$
< features have $2^N$ subsets), Wrappers may be too slow.
< 
< Another feature subset selector is Relief.  
< This is an instance based learning scheme \cite{Kir92,Kon94}
< that works by randomly sampling one instance within the data. It
< then locates the nearest neighbors for that instance from not
< only the same class but the opposite class as well. The values
< of the nearest neighbor features are  then compared to that of
< the sampled instance and the feature scores are maintained and
< updated based on this. This process is specified for some
< user-specified $M$ number of instances. Relief can handle noisy
< data and other data anomalies by averaging the values for $K$
< nearest neighbors of the same and opposite class for each
< instance \cite{Kon94}. For data sets with multiple classes, the
< nearest neighbors for each class that is different from that of the
< current sampled instance are selected and the contributions are 
< determined by using the class probabilities of the class in the dataset.
< 
< The experiments of Hall and Holmes \cite{hall03} reject numerous
< feature subset selection methods.  Wrapper is their preferred
< option, but only for small data sets.  For larger data sets, the
< stochastic nature of Relief makes it a natural choice.
< 
< Another reason to prefer Relief is that it can take advantage
< of Nighthawk's stochastic search.  Currently, Relief is a
< batch process that is executed after data generation is
< completed.  However, it may not be necessary to wait till the end
< of data generation to gain insights into which features are most
< relevant.  Given the stochastic nature of the algorithm, we can
< see feature work where Relief and a genetic algorithm work in
< tandem.  Consider the random selection process at the core of
< Relief: a genetic algorithm exploring mutations of the current
< set of parents is an excellent stochastic source of data.  In
< the future, we plan to apply Relief incrementally and in
< parallel to our GAs.
< 
< \subsection{Data Collection and Analysis}
< 
< We instrumented Nighthawk so that every time a chromosome was
< evaluated, it printed the current value of every gene and the
< final fitness function score.  (For the two BitSet gene types,
< we printed only the cardinality of the set.)
< For each of the 16 Collection and Map classes from {\tt java.util},
< we ran Nighthawk for 40 generations.  Each class therefore
< yielded 800 observations, each consisting of a gene value vector
< and the chromosome score.
< 
< %Relief assumes continuous data and Nighthawk's performance
< Relief assumes discrete data and Nighthawk's performance
< scores are continuous.  To enable feature subset selection, we
< first discretized Nighthawk's output. A repeated pattern
< across all our experiments is that the Nighthawk scores fall
< into three regions:
< \begin{itemize}
< \item The 65\% majority of the scores are within 30\% of the top
<   score for any experiment. We call this the {\em high plateau}.
< \item A 10\%  minority of scores are less than 10\% of the maximum
<   score.  We call this region {\em the hole}.
< \item After the high plateau there is a  {\em slope} of
<   increasing gradient that falls into {\em the hole}. This {\em
<   slope} accounds for 25\% of the results.
< \end{itemize}
< Accordingly, to select our features, we sorted the results and
< divided them into three classes: bottom 10\%, next 25\%, remaining
< 65\%. Relief then found the subset of features that distinguished
< these three regions.  
< 
< Using the above discretization policy, we ran Relief 10 times in a 10-way
< cross-validation study.
< The data set was divided
< into $10$ buckets. Each bucket was (temporarily) removed and Relief was
< run on the remaining data. This produced a list of ``merit'' figures for
< each feature (this ``merit'' value is an internal heuristic measure
< generated from Relief, and reflects the difference ratio of neighboring
< instances that have different regions). We took the maximum merit ($Max$)
< and generated a {\it Selected} list. A feature was {\it Selected} if its merit
< $m$ was $m> \alpha Max$ (e.g. at $\alpha=0.5$ then we selected all
< features that score at least half the maximum merit).
< 
< \begin{figure}
< 
< \begin{center}
< \begin{tabular}{|c|c|}
< $\alpha$ & $| {\it Selected} | $\\\hline
< 0.9 & 39 \\
< \hline
< 0.8 & 62 \\
< \hline
< 0.7 & 112 \\
< \hline
< 0.6 & 217 \\
< \hline
< 0.5 & 439\\
< \hline
< \end{tabular} \vspace{2mm} \\
< \end{center}
< 
< \caption{
<   Numbers of selected features for values of $\alpha$.
< }
< \label{selected-features-fig}
< \end{figure}
---
> reports total wall clock time, not CPU time.  This may show
> run times that do not reflect the testing cost to a real user.
2047,2190c1913,1914
< Figure \ref{selected-features-fig} shows the number of features
< that were {\it Selected} in
< our 19 examples, using different values for $\alpha$. Note that as
< $\alpha$ increases, we selected fewer and fewer features.
< That is, this method allowed us to discover the genes that were most
< important in selecting
< for the {\em high plateau} and not the {\em slope} or {\em hole}.
< 
< \begin{figure}
< 
< \begin{center}
< \begin{tabular}{|l|l|l|l|l|}
< Gene type & $\alpha=0.9$ & $\alpha=0.5$ & Best avg ranks & Best merit \\
< \hline
< {\tt candidateBitSet}         & 18 & 106 & 1, 1, 1          & 0.373 \\
< \hline
< {\tt chanceOfNull}            &  0 &   3 & 24.1, 24.3, 25.4 & 0.166 \\
< \hline
< {\tt chanceOfTrue}            &  2 &   7 & 1, 3.9, 5.7      & 0.150 \\
< \hline
< {\tt lowerBound}              &  1 &   9 & 4.1, 5.6, 7.2    & 0.129 \\
< \hline
< {\tt methodWeight}            &  0 &  11 & 7.5, 7.8, 10.6   & 0.144 \\
< \hline
< {\tt numberOfCalls}           &  0 &   1 & 7.2, 8.4, 16.7   & 0.195 \\
< \hline
< {\tt numberOfValuePools}      &  0 &   6 & 14.1, 17.9, 20   & 0.186 \\
< \hline
< {\tt numberOfValues}          &  0 &  28 & 5.2, 5.2, 7.4    & 0.160 \\
< \hline
< {\tt upperBound}              &  8 &  16 & 1, 1, 1          & 0.267 \\
< \hline
< {\tt valuePoolActivityBitSet} & 10 & 252 & 1, 1.4, 2        & 0.267 \\
< \hline
< \end{tabular}
< \end{center}
< 
< \caption{
<   Merit analysis of Nighthawk gene types.
< }
< \label{genes-merit-fig}
< \end{figure}
< 
< Since our goal was to identify gene types that were not useful and
< that could therefore possibly be eliminated from chromosomes, we
< tabulated the properties of each of the different types of genes.
< Figure \ref{genes-merit-fig} shows the result.  The three most
< useful types of genes for the software under test were clearly
< {\tt candidateBitSet},
< {\tt valuePoolActivityBitSet}, and
< {\tt upperBound},
< since genes of these types sometimes had merit 90\% or more of
< the maximum merit, often were ranked first or second in merit,
< and were the only gene types such that some genes of that type
< had merit greater than $0.2$ on some runs.  This result confirms
< our intuition that changing the value reuse policy encoded in
< the genes of type 
< {\tt candidateBitSet} and
< {\tt valuePoolActivityBitSet}
< is an important way of improving the performance of a Nighthawk
< chromosome.
< The good performance of {\tt upperBound} is attributable to the
< fact that for all the hash table-related units, adjusting this
< parameter caused a ``load factor'' parameter in some of the
< constructors to be able to take on values that led to the table
< being reallocated frequently.
< 
< Just as clearly, the two least useful gene types for the
< software under test were
< {\tt chanceOfNull} and
< {\tt numberOfValuePools}, since neither were ranked better than
< than 14th in merit for any run.
< This does not indicate that null values and multiple value pools
< for each type were not important, only that changing the default
< values of these genes (3\% chance of choosing null, and two
< value pools) did not usually result in improved performance.
< 
< \subsection{Optimization and Performance Comparison}
< 
< In order to see whether our analysis was useful, we optimized
< the Nighthawk code.
< We decided to retain the four gene types with
< the highest maximum merit
< ({\tt candidateBitSet},
< {\tt valuePoolActivityBitSet},
< {\tt upperBound}, and {\tt numberOfCalls}),
< and also two other gene types:
< {\tt lowerBound}, because it was paired with 
< {\tt upperBound}; and
< {\tt numberOfValues}, because it had the highest maximum
< merit of the remaining well-ranked gene types.
< 
< Accordingly, we adjusted Nighthawk's code to delete all
< references to the other four gene types
< ({\tt chanceOfNull},
< {\tt numberOfValuePools},
< {\tt chanceOfTrue}, and
< {\tt methodWeight}).  We changed the code that depended on the
< values stored for such genes in the current chromosome, so that
< it instead used the default initial values for those gene types.
<  
< On each of the 16 Collection and Map classes,
< we performed one run of the original Nighthawk,
< and one run of the new, optimized version.
< For each version of Nighthawk, we then
< performed the same analysis that is reflected in Figures
< \ref{collection-map-cov} and
< \ref{collection-times};
< that is, we asked RunChromosome to run 10 test cases with the
< winning chromosome and measured the coverage, and we calculated
< the clock time taken by Nighthawk to achieve the highest
< coverage it achieved.  We then compared the original and
< optimized Nighthawk.
< 
< The chromosomes resulting from the optimized Nighthawk achieved
< slightly higher coverage than those from the original, though a
< $t$ test concluded that the difference was not statistically
< significant ($p=0.058$).  However, the optimized Nighthawk was
< faster to a statistically significant degree ($p=0.010$).  The
< time ratio between the original and optimized systems
< was 1.46, i.e.\ the
< optimized system was 46\% faster.  We can therefore say that
< the process of analyzing the usefulness of the genes resulted
< in a system that ran substantially more quickly but achieved the
< same high coverage of the SUT.
< 
< \subsection{Discussion}
< 
< The feature subset selection and optimization procedure worked
< well for us when we collected data from a set of classes and
< then optimized Nighthawk for those classes.  This does not mean
< that the optimized Nighthawk would necessarily work
< well for other classes.  For instance, for some classes there
< might be a great deal of code that is accessible only by giving
< null pointers as parameters; we could expect our newly optimized
< version of Nighthawk to perform poorly on such classes, since
< there would be a fixed 3\% chance of choosing null.
< 
< However, the results of the optimization exercise indicate that
< FSS-based analysis is a valuable adjunct to the GA in this
< application.  It is for this reason that we envision in the
< future integrating FSS into a genetic-random testing algorithm,
< so that the algorithm can improve its performance dynamically,
< while GA mutation and recombination is going on.
---
> %\input{fssold}
> \input{fssnew}
2205,2207c1929,1940
< We have also shown that we were able to simplify the design
< of the GA system and improve its runtime using automatic
< feature subset selection.
---
> 
> We have also shown that we were able to optimize and
> simplify 
> meta-heuristic search tools.
> Metaheuristic tools (such as genetic algorithms and simulated
> annealers) typically mutate some aspect of a candidate solution
> and evaluate the results.  If the effect of mutating each aspect
> is recorded, then each aspect can be considered a feature and
> is amenable to the FSS processing described here.  
> In this way,
> FSS can be used to automatically find and remove
> superflous parts of the search control.
2215a1949,1956
> Further, we can see a promising line of research were the cost/benefits
> of a particular meta-heuristic are tuned to the particulars of a specific problem.
> Here, we have shown that if we surendeer $\frac{1}{10}$-th of the coverage, we can 
> run Nighthawk ten times faster. While this is an acceptable trade-off in many domains,
> it may be depreciated in
> safety critical applciations. 
> More work is required to understand how to best  match meta-heuristics (with or without FSS)
> to particular problem domains.
2316c2057
< % use section* for acknowledgement
---
> % use section* for acknowledgment