\section{Evaluation Metric: ``The Koru Diagram''} \label{korudiagram} \begin{figure}[!ht] \begin{center} \resizebox{100mm}{!}{\includegraphics{figures/chapter4/4.1/koru.png}} \caption{Illustration of Main Components of Koru Diagram.} \label{fig:koruexample} \end{center} \end{figure} \fig{koruexample} is an example of the basic stucture of the Koru Diagram. The Koru Diagram is a method of evaluation the performance of different detectors generated by machine learners or other means, some of which are discussed below in \sect{defectlearners}. The three lines in \fig{koruexample} correspond to the three detectors described in \sect{defectlearners}. The way a detector is plotted using this diagram is the key to understanding its evaulation. The x-axis in the diagram is a normalized version of effort. Effort here can explained as the current total number of lines of code explored by this detector over the total number of lines of code in the entire project. This will be better described in \sect{defectdata}. The y-axis is the probability of detection of the detector being plotted. The curves of each detector on the Koru Diagram are a measure of how much effort is gained as defective modules are identified. There are a few things one can gain from this graph that are not explicitly defined in the plots. For instance, if a curve in the Koru Diagram has a lot of plateaus in which effort is growing but the probability of detection is not, one can see infer that the detector is reporting non-faulty modules as faulty. Another implicit observation that can be made will be explained in \sect{defectoracle}. A third observation that can easily be infered from the diagram and is illustrated in \sect{defect} is that most detectors do not find all defective modules. This is due to the fact that machine learners are not perfect and have to insist on some flaws to avoid over-fitting. The way this is reflected in evaluation of the detector can be found in \sect{auc}. A final observation of the Koru Diagram are the points 0,100 and 100,100. No detector will ever pass through or stop at the point 0,100. This is because that point implies that all of the defective modules are found and no code has been searched. This is an impossibility because a module must have some source code associated with it. The position 100,100 is special because any learner that finishes at this point will have looked at every single module in the project. The reason the manual and launam curves on \fig{koruexample} end there is explained in \sect{defectmanup} and \sect{defectmandown}. In order to generate the information to plot a curve on the Koru Diagram, one must take all of the modules that the detector classifies as defective and sort them based on their lines of code in ascending order. All curves start at point 0,0 on the graph. From there the effort and pd points are plotted such that for each defective module in the list, if that module is actually defective, increment the probability of detection by the fraction that one module is worth. For every module that was classified as defective, increment the effort by the portion of the effort that module adds to the current total. So a point on the Koru Diagram can be interpreted as the amount of effort this detector has incited to detect the number of modules it has detected. A curve on the Koru Diagram is the path through the space the detector takes to reach its goal. The reason the modules classified as defective are sorted in ascending order is because faults are distributed disproportionately in smaller modules \cite{quantitative}. This obversation can be used as a method for an expert using the detector for their system to find a reasonable stopping point. Essentially, if the distribution of the faults in a system follow the Pareto Distribtion described in \sect{pareto} as was observed in \cite{NEEDS CITATION} and repeated in the data sets that were used in \sect{defect}, than sorting the modules classified as defective in ascending order by the lines of code should produce a curve most similiar to the oracle curve illustrated in \fig{koruexample}. If the opposite is true, that is that the largest modules contain the most faults and smaller modules have a far lower fault-density \cite{NEEDS CITATION}, than the oracle curve will have a shape similiar to the launam. The curve will still stop at the same location in the diagram, but the shape will grow very slowly at first and then grow very quickly towards the final position. This is the exact opposite of what is being illustrated in \fig{koruexample}. \subsection{The Pareto Distribution} \label{pareto} Reserved for my explaination of the Pareto Distribution, or the ``80-20'' observation. Description of pareto is found un \cite{lorenz}. Sources that also say Pareto distribution are \cite{paretosrc1} and \cite{paretosrc2} and \cite{paretosrc3}. \subsection{Special Detectors} \label{defectlearners} \subsubsection{Oracle} \label{defectoracle} The oracle curve described in this section can be seen illustrated in \fig{koruexample}. The oracle is the perfect detector. It is a purly hypothetic creation and exists to illustrate the best possible detector that could be generated by a learner. The oracle has one rule for detecting a defective module and that rule is illustrated with \fig{oraclerule}. \begin{figure}[!ht] \ssp \lstset{language=Python} \lstset{backgroundcolor=\color{listinggray}} \lstset{frameround=tttt} \lstset{basicstyle=\scriptsize} \lstset{showstringspaces=false, showspaces=false, showtabs=false} \begin{center} \begin{lstlisting}[frame=trBL]{} if [ defect = true] then report defective module endif \end{lstlisting} \end{center} \caption{Oracle Detection Rule} \label{fig:oraclerule} \dbsp \end{figure} The main purpose of the oracle is to show the minimum possible effort required to detect all of the defective modules in a project. As was observed in most of the project data that was used in the experiments of \sect{defect}, the oracle grows very rapidly at first and then slows down after most of the defects have been detected. There will be no horizontal line in the oracle's curve because the detector that generated it has a probability of false alarm of zero. The reason for the slower growth towards the end is because the last remaining defects are located in very large modules with respect to the median of the module size. Statistics taken about some of the projects that were used in \sect{defect} can be found in \sect{defectdatastats}. \subsubsection{Manual\-Up} \label{defectmanup} The manual\-up detector in this section can be seen illustrated as the line titled manual in \fig{koruexample}. A manual\-up detector is a detector that does not use a machine learner to generate a rule for each individual modules. Instead, manual\-up works by sorting all of the modules by lines of code in descending order and exploring every one of the modules in that order. This, along with the manual\-down detector described in \sect{defectmandown} are meant to be baselines of the Koru Diagram. If a detector cannot achieve a higher area than one of these two methods, it is not worth using. An intersting thing to note about the manual\-up detector is that in most of projects used in the experiments of \sect{defect}, it loses out to manual\-down. This is due to the distribution of faults in the modules that is described in \sect{pareto} and \sect{defectdatastats}. One article talks about the manual inspection method and its limits \cite{korusupt1}. This article discusses a type of inspection that is very similiar to our manual up inspection. \subsubsection{Manua\-Down} \label{defectmandown} The manual\-down detector in this section can be seen illustrated as the line titled launam in \fig{koruexample}. A manual\-down detector works very similar to a manual\-up detector with one slight difference that makes a lot of difference in how well it performs. The modules in the manual\-down detector are sorted in ascending order, like all of the other detectors in the experiments. The causes the growth to of the curve to be more similar to the growth of the oracle. \subsection{Area Under the Curve} \label{auc} The area under the curve for any curve on the Koru Diagram described in \sect{korudiagram} is a metric of evaulation for a given defect detector. As was stated in a previous section, not all detectors will end at the point of 100,100. In fact, a detector ending here is an indicater of a poor detector. This is because of the reasons described in \sect{korudiagram}. The maximum area for any possible detector is the area of the oracle described in \sect{defectoracle}. Upon inspection of the Koru Diagram, it is easy to see that the oracle curve has very little area under it if it is taken to end at the point it actually does end. To ensure that the oracle does in fact have the most area, the area under the line from the point the oracle ends, a point that will always be x,100, to 100,100 is added to the area of the oracle. The same goes for all curves in this space. If a curve ends at x,y, than the area under the line that begins at x,y and ends at 100,y is added to the total area of the detector. If a detector does not detect all defective modules in the project, than this will be reflected in the area metric. The reason to use a straight line is because this assumes the worst possible case of how a detector will work in the space it does not fire on. That is it will miss all defective modules that exist in this space. In order to effectively rank the areas under each detector's curve in this space, each area is divided by the oracle. This implies that the perfect detector will have a score of one. \section{Experiments with Which's Parameters} \label{which-exp} Which has some configurable parameters that it uses during its run that are mostly meant to change the run time of the application and not the way the application reports and evaluates its rules. These parameters can vary widely depending on the data Which is meant to explore. The following sections will describe different parameters and how Which uses these parameters to potentially change its run time. These parameters were discusses under the implemention section of \sect{stoppingcond} earlier in \chap{which}. \subsection{Changing the Maximium Selection Count} \label{expmaxselectcount} One of the main controls that determine the running time of Which and the potential quality of the rules that result from Which is the maximum number of combinations that can be made before Which completes execution. Which in its current state is an offline machine learner. That is, it takes as input all of the data that it will be using to create a final rule. Since a result is required in some reasonable time frame, a simple solution is to only allow Which to combine so many times before stopping and reporting the rule. This value should be set high enough to allow a reasonable search of the search space. \sect{expcheckparam} provides a solution with some experimental results in its favor that show that there is no real reason to worry about finding a value high enough to reasonably search the space yet low enough to not require too much time. \subsection{Changing the Check Parameters} \label{expcheckparam} Which employs a method of allowing to remove the burden of finding the optimal size for the maximum selection count described in \sect{expmaxselectcount} by allowing two parameters to be configured for every run of the application. These parameters are known as ``check every'' and ``improvement.'' \sect{checkevery} and \sect{imp} below further describe what these two parameters do and how they can be used to gain a faster running time without sacrificing any of the quality of the rules. \subsubsection{Check Every} \label{checkevery} Check every is a parameter of the application that allows the user to avoid a common ceiling effect that is shown in \sect{checkparamres} below. This effect can be attributed to what Menzies et. al. said in \cite{micro} about the ceiling effects of machine learners on static data. This parameter is used to control how often Which checks for a potential ceiling effect occuring in its list. A ceiling effect can be defined as the combonations that Which is creating to search the solution space either never having a better score or always having very similar scores. This results in Which never finding a better rule regardless of the number of combonations it makes. This ceiling effect can occur very rapidly and could result in a lot of the application's run time being wasted due to Which never finding a better rule than one just before the ceiling happened. A method of turning off this early stopping condition is to set the check every parameter to something that is greater than or equal to the number of picks Which will do. \subsubsection{Improvement} \label{imp} Improvement is a parameter of the application that allows the user to set how flat a ceiling has to be for it to be a ceiling. In other words improvement is what allows Which to determine if two rules are in fact the same in terms of their scores. This number ranges from 0 to 1 and stands for how much better the most recently generated rule is than one generated some time ago. If this number is close to 0, Which is much stricter in what it considers flat. In fact, a value of 0 for improvement will disable this early stopping condition all together. Improvement is tied to check every in that the improvement of the best rule in the stack is queried every check every times a new rule is created via combination. \begin{figure}[!ht] \begin{center} \begin{tabular}{cc} \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/vowel.png}} & \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/heart-c.png}} \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/soybean.png}} & \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/audiology.png}} \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/autos.png}} & \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/CheckEvery/Plots/segment.png}} \\ \end{tabular} \caption{Graphs Displaying the Early Maxima Phenomenon} \label{fig:check} \end{center} \end{figure} \subsection{Changing the List Size} \label{listsize} \subsubsection{Motivation for Experiments} \label{listsizemotivation} As stated in \sect{finitelist} in \chap{which}, the basic Which algorithm uses an infinite list. What this means is that every rule, or detector, that was ever generated exists in the sorted linked list. This poses two problems. The first problem is that if the number of picks is to be very large, the number of rules generated will be very large. This could cause the placement of each new rule to be a severe bottleneck in some situations. The second problem is in the random probability of the selection algorithm. The algorithm was meant to be so that a low scoring rule would have less of a chance of being selected compared to a high scoring rule. This uses the assumption that the combonation of high scoring rules will lead to a desired optimal solution faster than combining any rules at random at every combonation. This does, however, also still give a chance that a low scoring rule will be selected in combonation. This concept works under the assumption that a low scoring rule in conjunction with another, probably high scoring rule, will cause a desirable result that could be overlooked. The problem with having the list contain all possible rules to select from is that the chances that any one of a number of low scoring rules being selected is higher after several combonations have already been done. This leads to Which possibly missing some better combonations because it becomes bogged down with selecting bad combonations towards the end of the run. A solution to both problems would be to limit the size of the sorted linked list that Which uses to keep track of all of the different combonations generated so far. The idea would be to only keep the top N best scoring rules. This would make it so a worst case analysis of the Which algorithm could be calculated. Depending on the list size and number of overall combonations to be made, this could greatly enhance the run time of Which, without losing the performance in terms of proper rule generation. The last statement would only hold true if our assumption about combonations of two high scoring rules will lead to better scores than combonations of low scoring rules with either high or low scoring rules. If it is the case that this does not hold, than we should see a significant performance decrease in the Which algorithm. \sect{listsizeexperiments} explains the details of the experiments ran on this topic. \subsubsection{Experiments} \label{listsizeexperiments} This experiment was done using six different sorted list sizes, descriptions of which can be seen in \tab{finitelistdesc}. The method of training and testing on the data is explained in \sect{xval}. The type of evalation that was used for this experiment was the Koru Diagram described in \sect{korudiagram}. In this experiment, only Which was used as compared to itself in the native form of having an infinite list size. This is because this experiment was only meant to judge the performance, in terms of rule score, change that may or not occur due to changing the infinite list to a finite one. In \sect{listsizeres} below the results of this experiment are shown. \begin{table}[!ht] \begin{center} \begin{tabular}{| r | l |} \hline \textbf{Learner} & \textbf{Description} \\ \hline \hline oracle & The perfect detector as described in \sect{defectoracle}. \\ \hline which2 & The standard infinite list version of Which. \\ \hline which10 & Which running with a maximum list size of 10. \\ \hline which20 & Which running with a maximum list size of 20. \\ \hline which50 & Which running with a maximum list size of 50. \\ \hline which100 & Which running with a maximum list size of 100. \\ \hline which1000 & Which running with a maximum list size of 1000. \\ \hline \end{tabular} \caption{Description of Different Learners Used in Experiment of \sect{listsize}.} \label{tb:finitelistdesc} \end{center} \end{table} \subsubsection{Results} \fig{listsizegraphs} shows the results of some of the \label{listsizeres} \begin{figure}[!ht] \begin{center} \begin{tabular}{cc} \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/FiniteList/ar4_3.png}} & \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/FiniteList/ar4_1.png}} \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/FiniteList/ar4_2.png}} & \resizebox{60mm}{!}{\includegraphics{figures/chapter4/ExperimentsWithWhichParams/FiniteList/ar5_2.png}} \\ \end{tabular} \caption{Graphs Displaying the Effects of Changing the Sorted List Maximum Size.} \label{fig:listsizegraphs} \end{center} \end{figure} \section{TAR3} \label{t3} \subsection{Comparison of TAR3 and Which} \label{t3vwhich} \subsection{Design of Experiments} \label{t3design} \subsection{Which's Heuristic} \label{t3whichh} \subsubsection{Data Used in Experiments} \label{t3data} \begin{table}[!ht] \begin{center} \begin{tabular}{|r|c|c|c|} \hline \textbf{Name} & \textbf{\# Classes} & \textbf{``Best Class'' \%} & \textbf{\# Instances} \\ \hline \hline anneal & 6 & 4.43951 & 1058 \\ \hline audiology & 24 & 0.884956 & 298 \\ \hline autos & 7 & 12.9808 & 336 \\ \hline breast-cancer & 2 & 29.4118 & 394 \\ \hline colic & 2 & 36.9565 & 726 \\ \hline credit-g & 2 & 30 & 1301 \\ \hline diabetes & 2 & 34.8958 & 863 \\ \hline hypothyroid & 4 & 0.0529801 & 3887 \\ \hline ionosphere & 2 & 63.3803 & 458 \\ \hline kr-vs-kp & 2 & 47.7337 & 3351 \\ \hline letter & 26 & 3.67 & 20191 \\ \hline mushroom & 2 & 48.1851 & 8301 \\ \hline primary-tumor & 22 & 7.07965 & 596 \\ \hline segment & 7 & 14.2857 & 2406 \\ \hline sick & 2 & 6.11921 & 3885 \\ \hline sonar & 2 & 53.3654 & 408 \\ \hline splice & 3 & 51.8321 & 3379 \\ \hline vehicle & 4 & 23.4393 & 1058 \\ \hline vote & 2 & 38.3562 & 651 \\ \hline vowel & 11 & 9.09091 & 1201 \\ \hline waveform-5000 & 3 & 33.0735 & 5086 \\ \hline weather & 2 & 35.7143 & 23 \\ \hline weather.nominal & 2 & 35.7143 & 23 \\ \hline \end{tabular} \caption{Description of the UCI\cite{uci} data sets used in the experiment under \sect{t3}.} \label{tb:ucidesctable} \end{center} \end{table} \subsubsection{Evaluation Criteria} \label{t3eval} \subsection{Results} \label{t3results} \section{Multi Class Classification} \label{multiclass} The purpose of this experiment was to show how Which performed in a situation where more than two classes were present in the data sets. The data sets used i this experment come from the UCI\cite{uci} data sets listed in \tab{ucidesctable}. The data sets that were selected were selected solely on the criteria that they had more than two classes. This was because a data set that has only two classes can produce a high accuracy by interpreting the rule that Which generates as having an added else clause. For instance, Which may generate a rule like \fig{multiclasswhichrule}. In the case of only having two classes, this could be understood as, ``if that rule, then class X, else class Y.'' However, in the case of having more than two classes, the rule could only be interpreted as, ``if that rule, then class X, else no answer.'' This leads to extreme problems in the prediction quality of the rule. As described in \chap{which}, Which produces rules that best predict the ``best class'' in a data set. Best class is meaningly when the goal of test is to create something that will predict the class that new instances in a data set fall into. However, in the case of most of these UCI data sets, there is no best class, that is the classes are all equally valued. So selecting one of these classes as the best is completely artificial. \begin{figure}[!ht] \ssp \lstset{language=Python} \lstset{backgroundcolor=\color{listinggray}} \lstset{frameround=tttt} \lstset{basicstyle=\scriptsize} \lstset{showstringspaces=false, showspaces=false, showtabs=false} \begin{center} \begin{lstlisting}[frame=trBL]{} if [ outlook = overcast ] AND temp = [ high OR low ] \end{lstlisting} \end{center} \caption{Oracle Detection Rule} \label{fig:multiclasswhichrule} \dbsp \end{figure} This experiment was created to show an area that the current version of Which fails in. It was meant to illustrate a counter to the experiments of \sect{defectmdp} and \sect{defectturkey} discussed in this chapter. Essentially, Which fits into a niche portion of the data mining world. A possible solution for this problem in Which is discussed in \chap{futurework}. \subsection{Which's Heuristic} \label{muilticlasswhichh} A common metric used for evaluating a decision making process that is created by a machine learner is \textit{accuracy}. The equation for accuracy can be seen in \eq{accuracy}. In this equation, $TP_{i}$ stands for the number of true positives of the $i^{th}$ class. This is the number of the $i^{th}$ class that was accuratly classified as the $i^{th}$ class. The n in the summation is the number of class. Finally, the $class_{i}$ in the denomenator of the equation is the sum of all the classes in this data set. Accuracy effectively depicts the total number of classes that were classified correctly. \begin{center} \begin{equation} \displaystyle h_{accuracy} = \frac{\sum_{i=1}^n TP_{i}}{\sum_{i=1}^n class_{i}} \label{eq:accuracy} \end{equation} \end{center} There are some problems associated with accuracy that are described in \cite{blindspot}. For example, if there is a data file that has one class that is very infrequent in the data set and a learner never correctly classifies that class, the accuracy may still be very high. Take the following example in \fig{accexample}. In this two class example, one class consists of 99\% of the total distribution. If a machine learner produces the rule that classifies everything as class X, it will have an accuracy of 99\%. The major problem of this is if the class that represents 1\% of the population is the class we are trying to predict for, that information is entirely lost even though the detector is 99\% accurate. This type of detection, isolating one class and developing a rule for that class, is called lift learning and described in \sect{t3}. This is also the type of learning that Which uses in its algorithm, as said in \chap{which}, only one rule is produced and that is the rule that best predicts a ``best class.'' Which should and does fail in this accuracy over a multiple class data set test. \subsection{Design of Experiments} \label{multiclassdesign} In order to do this experiment, only a few of the data sets from \tab{ucidesctable} were chosen. This data sets all had one thing in common and that was that there were more than two classes in the data set. Once the data was chosen a discretization method known as FayyadIrani\cite{fayyadirani}, discussed in \chap{2}, that was implemented in the wEka\cite{weka} was used as a preprocessor to allow each of the learners to be on the same playing field. What is meant by this is that each of the learners have their on discretization methods in them. Using FayyadIrani as a preprocessor was a way of ensuring that each of the learners were actually learning from the same data. After discretization was complete, the data was broken in into three train and three test sets with a process that is described in \sect{xval}. The learners where ran using train n and test n and the results were complied into MannWhitney U Tests. There was one test generated for each of the individual files and an overall test on all of the scores. \subsection{Results} \label{multiclassresults} \tab{mwumulticlass} contains the overall MannWhitney U test results for these learners on the data sets. It can be seen from these results that j48, \nbc, and ripper all perfrom the same on all of the data, they all tie with each other. It can be shown also that Which is completely outperformed in this environment as well. \begin{table} \begin{center} \begin{tabular}{|r|c|c|c|c|} \hline \textbf{key} & \textbf{ties} & \textbf{win} & \textbf{loss} & \textbf{win-loss-at-99\%} \\ \hline \hline nBayes & 2 & 1 & 0 & 1 \\ \hline jRip & 2 & 1 & 0 & 1 \\ \hline j48 & 2 & 1 & 0 & 1 \\ \hline which & 0 & 0 & 3 & -3 \\ \hline \end{tabular} \caption{Overal MannWhitney U Test on Selected UCI Data Sets.} \label{tb:mwumulticlass} \end{center} \end{table} \fig{quartilemulticlass} illustrates further the results of this experiment. Quartile charts are discussed in detail and used to report results in \cite{staticlearn}. Quartile charts have certain properties about them that make them valuable for reporting results of cross validation experiments. First of all, so called standout results are clearly visable in these charts. Looking at the results in the MannWhitney U test shows that Which did lose to all three other learners. However, the magnitude of how poorly it performed compared to the other learners is not captured in this table. \fig{quartilemulticlass} illustrates this very well in that the maximum accuracy score for Which over all the tests is 57.1. Another visual cue to the performance using the quartile chart is the median value, the black circle in the figure. It can be seen clearly in this chart that Which's median is zero, since no circle exists on the chart. The other thing about the medians that can be gained from looking at this is that the other three learners have similar medians, and also that all of these medians are higher than the maximum performance of Which. Another propertiy of the quartile chart is that it makes no parametric assumption about the underlying distribution of the data like t tests\cite{ttest}. Further explaination of the u test, t test, and quartile charts can be found in \chap{related}. \ssp \begin{figure}[!ht] \begin{center} \begin{tabular}{ccc} \textbf{Learner} & \textbf{Mean} & \textbf{Description} \\ j48 & 79.7 & \boxplot{33.6}{40.7}{79.7}{93.0}{6.4} \\ jRip & 79.7 & \boxplot{31.9}{35.2}{79.7}{93.4}{5.7} \\ nBayes & 82.9 & \boxplot{40.7}{26}{82.9}{91.3}{7.3} \\ which & 0.0 & \boxplot{0.0}{0.0}{0.0}{13.1}{44.0} \\ \end{tabular} \caption{Quartile Chart Illustrating the Results of \tab{mwumulticlass}.} \label{fig:quartilemulticlass} \end{center} \end{figure} \dbsp These results were expected before the experiment and, in actuality, the experiment was designed to illustrate this point of the Which design. The reason Which does so poorly with a multi class data file is because by design, it is unable to generate rules for each of the classes. It can only generate rules for one class. While it may be possible to create a tool around Which that runs the Which application once for each class in the data file, that is beyond the scope of this thesis. Further improvements to Which can be found in \sect{futurework}. \section{Defect Detection} \label{defect} One of the most impressive applications of the Which learner is that of defect detection in large software projects. Previous studies have shown that \nbc, described futher down in \sect{defectlearnernbayes} has been superior at detection compared to the other classical learners \cite{staticlearn}. \subsection{Data} \label{defectdata} \subsubsection{Basic Statistics Taken About Data} \label{defectdatastats} \begin{table}[!ht] \begin{center} \begin{tabular}{| l || l | l | l | l | l | l |} \hline \textbf{Data} & cm1 & kc1 & pc1 & ar3 & ar5 \\ \hline \textbf{Family} & mdp & mdp & mdp & Turkey & Turkey \\ \hline \hline \textbf{Modules} & 498 & 2109 & 1109 & 63 & 36 \\ \hline \textbf{Median} & 17 & 9 & 13 & 42.5 & 42 \\ \hline \textbf{Mean} & 29.6448 & 20.3723 & 23.3761 & 89.2698 & 75.889 \\ \hline \textbf{$\sigma$} & 42.7106 & 29.7474 & 35.2681 & 115.802 & 88.8691 \\ \hline \textbf{Min} & 1 & 1 & 0 & 3 & 5 \\ \hline \textbf{Max} & 423 & 288 & 602 & 670 & 477 \\ \hline \textbf{Defect Dist} & 9.8\% & 15.5\% & 6.9\% & 12.7\% & 22.2\% \\ \hline \end{tabular} \caption{Statistics Taken from a Few of the Data Sets Used.} \label{tb:datastattable} \end{center} \end{table} The data represented in \tab{datastattable} was taken from five of the twelve data sets that were used in the defect detection tests. The data sets used to create two of the three of the defect detection experiments were taken from the PROMISE website \cite{promise}. Cm1, kc1, and pc1 are part of the mdp family and come from NASA software projects. The Turkey data comes from a software company that creates control software for household appliances. The third experimental data comes from \att. This data was used under a non disclosure agreement and not much is allowed to be said about it. The results of the experiment and a discription of some of the attributes used. \subsection{Design of Experiments} \label{defectdesign} \subsubsection{Cross Validation} \label{xval} Cross Validation is a method of testing a machine learner when only one, flat data set exists \cite{c45}. The purpose of cross validation is to allow a learner to be tested on data it has not trained on when two distinct sets do not exists. Cross validation is a purely test based approach to evaluating machine learners. Cross Validation is a two step process that allows for a strong evaluation of a learner. Cross Validation has an inner and an outter phase that each consist of a variable number of steps. The inner phase involves breaking up the data set into two sets that are completely disjoint. A typical approach to this is to put 90\% of the data into the \textit{train} set and the remaining 10\% in the \textit{test} set. This process is repeated 10 times so once it is complete there will be 10 train and 10 test sets where each test set has completely unique data instances. This number 10 is typically refered to as \textbf{N}. The outter phase randomizes the data a set number of times, usually 10, and repeats the inner phase for each randomization. This outter phase count is typically refered to as \textbf{M}. The reason for randomizing the data every pass is to remove the potential for there to be patterns in the data file itself. Overall, this method runs the learner MxN times. This large number of runs allows different evaluation methods to determine how well the learner works overall. For the purposes of the mdp and Turkey data sets refered to in \sect{defectmdp} and \sect{turkeydata} respectively, am M value of 10 and an N value of 3 was used. The reason 3 was used for N was because some of the data files have very few defects in them and breaking up the data into 10 train\-test sets causes some test sets to have no defects in them. This caused erroneous reporting of the probability of detection of the learners in some sets. Using this setup, each learner was ran and tested 30 times for each data file. \subsection{Which's Heuristic} \label{defectwhichh} As discussed in \sect{scoring} earlier in this writing, Which allows any scoring heuristic to determine how a generated rule scores. Several different scoring heuristics were tried in the tests, but the one discussed in this section is the heuristic that was used for the defect detection experiments. \subsubsection{Balance} \label{balance} \eq{balanceeqn} illustrates the heuristic that Which uses to calculate the score of each rule. For our purposes, the best rule would be one that has a 100\% probability of detection and a 0\% probability of false alarm, this corresponds to the oracle discussed in \sect{defectoracle}. Because of this, the balance equation is the Euclidean distance from that point in ROC space. Scalar terms of $\alpha$ and $\beta$ are added to the equation to allow for different levels of importance for each of the terms. By default, these results are 1 and 1000 respectively. What this does is make a very low pf 3 orders of magnitude more important than the pd term. \begin{center} \begin{equation} h_{balance} = \frac{\sqrt{pd^2*\alpha + (1-pf)^2*\beta}}{\alpha+\beta} \label{eq:balanceeqn} \end{equation} \end{center} It should be noted that an equal contributor to how well a detector scores is how much effort is required to detect a certain percent of the defects in a project. This metric was added to the balance equation to make it look like \eq{balancewithefforteqn} and the results for Which were not ideal. As before, there are scalar terms added to the equation to allow for importance of each part to be specified. \begin{center} \begin{equation} h_{balancewitheffort} = \frac{\sqrt{pd^2*\alpha + (1-pf)^2*\beta + (1-effort)^2*\gamma}}{\alpha+\beta+\gamma} \label{eq:balancewithefforteqn} \end{equation} \end{center} The typical rule that was selected using the heuristic of \eq{balancewithefforteqn} was a middle ground. What this means is that the best rule that Which could report had a pd of 0.5, a pf of 0.5 and an effort 0.5. This is because the way in which effort and pd are correlated. While it is possible for a rule to be at the 0,1 mark in ROC space, it is impossible for a rule to be at the 0,1 mark in Koru space. This is because detecting a module as defective will always add some degree of effort to the rule. This causes the phenomenon of finding the middle ground. Also, even though effort is not evaluated by the rules during the growing phase, pf is given a really high priority, thus inducing a minimal effort. As described in \sect{korudiagram}, if a detector has a high pf than the overall effort will be very high whereas if a detector has a very low pf it will more closely follow the oracle curve. \subsubsection{Probability of Detection Divided by Lines of Code} \label{pdoverloc} \eq{pdoverloceqn} illustrates how a rule is scored using this heuristic. It was decided before these experiments took place that since it was easy enough to change the scoring method of Which, that we would also try a different scoring heuristic. \begin{center} \begin{equation} h_{\frac{PD}{LOC}} = \frac{PD}{\sum_{i=1}^n LOC_{i}} \label{eq:pdoverloceqn} \end{equation} \end{center} \eq{pdoverloceqn} was meant to create rules that have a high probability of detection and a low effort total. Instead of actually using effort here, which is a normalized amount of the lines of code totally looked at over the total lines of code, this equation uses the raw line of code measures. In \eq{pdoverloceqn} PD stands for the probability of detection and the summation in denomenator stands for the number of modules that were labeled as defective by a rule. $n$ is the total number of modules labeled as defective. $h_{\frac{PD}{LOC}}$ was used because it was slightly different than $h_{balancewitheffort}$ and we had hoped it would yield the kind of results that we wanted. That is, results that have a high probality of detection with a low amount effort required. The results of this heuristic are discussed in \sect{defectresults}. \subsection{Evaluation Criteria} \label{defecteval} The evaluation criteria for all of the detectors used in these experiments is the evaluation criteria described in \sect{auc}. Each detector is plotted in Koru space and then its normalized area under the curve is used for the Mann Whitney U\-test. With the exception of the \att data used in \sect{atntdata}, the other experiments all used a cross validation. The reasons the \att data did not are discussed in section discussing the \att experiment. \subsection{Probability of Detection vs. \%Lines of Code} \label{defectpdvseffort} Probability of Detection vs. \%Lines of Code is a general name for the Koru diagram. With each of the experiments in \sect{defectmpd}, \sect{turkeydata}, \sect{atntdata}, and \sect{micro}, this is both the method of displaying the results as well as the method of scoring the results. Several of the figures in this thesis use this type of graph. \section{MDP Data} \label{defectmdp} \begin{table}[!ht] \begin{center} \begin{tabular}{|l | l | l|} \hline \textbf{Metric} & \textbf{Metric} & \textbf{Definition} \\ \textbf{Type} & & \\ \hline \hline McCabe & v(G) & Cyclomatic Complexity \\ & ev(G) & Essential Complexity \\ & iv(G) & Design Complexity \\ & LOC & Lines of Code \\ \hline Derived & N & Length \\ Halstead & V & Volume \\ & L & Level \\ & D & Difficulty \\ & I & Intelligent Count \\ & E & Effort \\ & B & Effort Estimate \\ & T & Programming Time \\ \hline Line & LOCode & Lines of Code \\ Count & LOComment & Lines of Comment \\ & LOBlank & Lines of Blank \\ & LOCCodeAnd\- & Lines of Code \\ & Comment & and Comment \\ \hline Basic & UniqOp & Unique Operators \\ Halstead & UniqOpnd & Unique Operands \\ & TotalOp & Total Operators \\ & TotalOpnd & Total Operands \\ \hline Branch & BranchCount & Total Branch Count \\ \hline \end{tabular} \caption{Halstead and McCabe Descriptions of MDP Data Sets.} \label{tb:mdpdatadesctable} \end{center} \end{table} \cite{randomforest} contains a table that describes all of the different attributes used in both the MDP and the Turkey data sets. \tab{mdpdatadesctable} is the table that was used in that article. All of these statistics are taken on a per module basis. In this instance, a module is the simplest whole unit in the language. In C this would be a function and in Java this could be considered a method of a class. This is not to be confused with the broader term of module that is discussed in \cite{ostrand1}. In that article, it is said that the file is far more fine grained than a module. Their definition of module is something similiar to a collection of Java classes that are a stronly cohesive, strongly coupled system. Here, we use the term module to be something that is more fine grained than a file. There are three types of statistics that are taken about each module, these are the McCabe \cite{mccabe, testless,randomforest}, Halstead \cite{halstead,testless,randomforest}, and line of code readings. McCabe's idea was that it was the complexity in the pathways of the source code that would lead to more faults. The three attributes that are associated with his idea can be seen in \tab{mdpdatadesctable}. Cyclomatic complexity can be described as the number of linearly independant paths that can be taken through source code. A simple explanation of this would be a simple module that contained one if statement, or conditional. This would have a cyclomatic complexity measure of 2. Essential complexity refers to the cyclomatic complexity of a module after all well structured paths have been reduced to statements. For instance, unrolling a for or while loop into its statements. Design complexity of a module is the cyclomatic complexity of its simplified form. Halstead's idea was that if the code was hard to read, it would have more faults. Halstead took measures about the code itself, instead of the flow of the program. These attributes can be seen in \tab{mdpdatadesctable}. The final measurements taken about the modules are basic line of code readings. These give specific line count readings about the code and its executable. \subsection{Experiment} \label{defectexp} This experiment was done on 8 different MDP data sets using a 10x3 cross-validation on nine different learners plus the oracle, manualUp, and manualDown detectors. The total number of times this each data file was used is $8x10x3x12=2880$. The statistics taken on each of the detectors generated were the probability of detection and the total effort used after all of the modules the detector said were defective were inspected. In order to get the specific data for the Koru Diagram and for the Area Under the Curve calculations, a step by step of each detected defective module was recorded as well. This process was explained in \sect{koru}. Several types of \subsection{Results} \label{defectresults} As a quick note concerning the graphs in this section, several variants of Which were used as described in \sect{defectexp}, however, most of them did not prove fruitful at all, while these results are reported in the MannWhitney and Quartile Chart reports, they were ommited from the graphs for the sake of clarity. We have divided these results into three seperate categories to illustrate different points. The three categories are Which2 $\>$ manualUp $\>$ classic learners, Which2 $\>$ everything else, manualUp not coming in second, and Everything $\>$ Which. \subsubsection{Category One} \label{defectrescat1} The first category is the one where Which2 has the best overall score, followed by manualUp, followed by the rest of the learners. This category can be seen graphically in \fig{mdpreswhichmanup}. We feel this category is significant because it illustrates two important points. The first point is that Which2 was the overall winner and it can be seen in \fig{mdpreswhichmanup}. The second point is that the manual detector manualUp, described in \sect{defectmanup}, performs better than any of the classic learners. This says that the detectors generated by these learners gives us nothing better than what manually checking all of the source code in this manual up way, which is sorting the modules in ascending order by lines of code only. Five out of the eight total data files used in this experiment fit into this category, which is a little over half of the total. \begin{figure}[!ht] \begin{center} \begin{tabular}{cc} \resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/cm1.png}} & \resizebox{60mm}{!}{ \begin{tabular}{ccc} \textbf{learner} & \textbf{mean} & \textbf{Description} \\ which2 & 68.1 & \boxplot{0.0}{57.4}{68.1}{71.5}{10} \\ manualUp & 59.8 & \boxplot{48.3}{9.1}{59.8}{65.3}{6.2} \\ nBayes & 52.1 & \boxplot{36.2}{9.8}{52.1}{59.1}{10.1} \\ manualDown & 47.6 & \boxplot{33.6}{6.7}{47.6}{49.3}{10.9} \\ which8loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{16.1} \\ which8 & 11.4 & \boxplot{0.0}{0}{11.4}{26.2}{9.4} \\ which4loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{10.4} \\ which4 & 0.0 & \boxplot{0.0}{0}{0.0}{41.2}{27.8} \\ which2loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{40.7} \\ jRip & 5.8 & \boxplot{0.0}{0}{5.8}{11.5}{12.6} \\ j48 & 0.1 & \boxplot{0.0}{0}{0.1}{12.9}{20.4} \\ \end{tabular} } \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/kc1.png}} & \resizebox{60mm}{!}{ \begin{tabular}{ccc} \textbf{learner} & \textbf{mean} & \textbf{Description} \\ which2 & 76.0 & \boxplot{71.4}{2.4}{76.0}{78.0}{3.8} \\ manualUp & 67.6 & \boxplot{64.5}{1.3}{67.6}{68.9}{1.1} \\ nBayes & 61.9 & \boxplot{54.9}{5.3}{61.9}{63.0}{4.7} \\ which4 & 52.9 & \boxplot{0.0}{49.8}{52.9}{55.2}{5.3} \\ manualDown & 43.3 & \boxplot{39.7}{2.5}{43.3}{45.2}{2.5} \\ j48 & 27.8 & \boxplot{11.6}{8.9}{27.8}{31.7}{8.4} \\ jRip & 21.3 & \boxplot{10.2}{7.1}{21.3}{25.2}{7.2} \\ which8loc & 0.0 & \boxplot{0.0}{0}{0.0}{1.0}{1.2} \\ which8 & 0.0 & \boxplot{0.0}{0}{0.0}{2.0}{31.9} \\ which4loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{1.1} \\ which2loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{2.1} \\ \end{tabular} } \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/kc2.png}} & \resizebox{60mm}{!}{ \begin{tabular}{ccc} \textbf{learner} & \textbf{mean} & \textbf{Description} \\ which2 & 81.6 & \boxplot{65.6}{10.4}{81.6}{84.6}{3.9} \\ manualUp & 69.3 & \boxplot{57.9}{7.5}{69.3}{71.0}{5.6} \\ nBayes & 58.7 & \boxplot{47.0}{7.8}{58.7}{61.0}{8.4} \\ which4 & 59.4 & \boxplot{43.1}{9.4}{59.4}{66.8}{12.8} \\ manualDown & 46.1 & \boxplot{37.9}{5.2}{46.1}{52.3}{10.1} \\ which8 & 41.2 & \boxplot{26.3}{10.2}{41.2}{47.6}{8.9} \\ j48 & 41.2 & \boxplot{26.0}{10.1}{41.2}{45.9}{13.9} \\ jRip & 42.2 & \boxplot{22.2}{13.8}{42.2}{49.5}{15.7} \\ which8loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{5.9} \\ which4loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{2.9} \\ which2loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{3.1} \\ \end{tabular} } \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/mw1_mod.png}} & \resizebox{60mm}{!}{ \begin{tabular}{ccc} \textbf{learner} & \textbf{mean} & \textbf{Description} \\ which2 & 62.4 & \boxplot{35.8}{21.6}{62.4}{70.8}{12.5} \\ manualDown & 60.2 & \boxplot{42.8}{9.3}{60.2}{63.7}{8.1} \\ manualUp & 47.8 & \boxplot{37.1}{6.9}{47.8}{51.9}{10.6} \\ which8 & 39.3 & \boxplot{0.1}{35.5}{39.3}{47.6}{12.8} \\ nBayes & 41.7 & \boxplot{19.5}{13.6}{41.7}{47.7}{14.4} \\ which4 & 42.7 & \boxplot{0.0}{25.8}{42.7}{49.8}{10.8} \\ j48 & 20.0 & \boxplot{0.0}{10}{20.0}{24.3}{18.6} \\ jRip & 15.8 & \boxplot{0.0}{7.9}{15.8}{31.2}{18.2} \\ which8loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{10.5} \\ which4loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{10.4} \\ which2loc & 0.0 & \boxplot{0.0}{0}{0.0}{0.0}{25.6} \\ \end{tabular} } \\ \resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/pc1.png}} & \resizebox{60mm}{!}{ \begin{tabular}{ccc} \textbf{learner} & \textbf{mean} & \textbf{Description} \\ manualUp & 60.6 & \boxplot{52.1}{6.3}{60.6}{63.4}{8.2} \\ nBayes & 51.5 & \boxplot{36.4}{9.7}{51.5}{53.4}{7.5} \\ manualDown & 44.6 & \boxplot{32.3}{9.6}{44.6}{46.2}{9.1} \\ j48 & 19.2 & \boxplot{3.1}{9.4}{19.2}{24.6}{16.9} \\ jRip & 15.1 & \boxplot{0.0}{11}{15.1}{23.2}{7.6} \\ which8 & 22.6 & \boxplot{0.0}{9.1}{22.6}{30.7}{17} \\ which8loc & 7.4 & \boxplot{0.0}{0}{7.4}{12.7}{9.4} \\ which4loc & 3.8 & \boxplot{0.0}{0}{3.8}{14.8}{15.5} \\ which4 & 0.0 & \boxplot{0.0}{0}{0.0}{50.3}{8.7} \\ which2loc & 0.0 & \boxplot{0.0}{0}{0.0}{9.7}{16.6} \\ which2 & 65.0 & \boxplot{0.0}{0}{65.0}{71.1}{10.7} \\ \end{tabular} } \\ \end{tabular} \caption{MDP Data Files where Which2 $\>$ manualUp $\>$ rest. From top to bottom, left to right, these are cm1, kc1, kc2, mw1\_mod, and pc1.} \label{fig:mdpreswhichmanup} \end{center} \end{figure} \subsubsection{Category Two} \label{defectrescat2} The second category is the one in which Which2 $\>$ everything else, but manualUp is beaten by some of the classic learners. This category illustrates two points that we found important. The first one is that Which2 wins overall, another victory for Which2, and that some of the detectors generated in this category perform better than the manual inspection. This second point is illustrating that manualUp should not be the current de facto standard. This category contains two out of the eight data sets that were used. That brings the total number of files that Which2 had the best detector to seven out of eight. \begin{figure}[!ht] \begin{center} \begin{tabular}{cc} %\resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/kc3_mod.png}} & %\resizebox{60mm}{!}{\includegraphics{figures/chapter4/MDP/plots/pc3_mod.png}} \\ \end{tabular} \caption{MDP Data Files where Which2 $\>$ rest. From left to right, these are kc3\_mod, and pc3\_mod.} \label{fig:mdpreswhichall} \end{center} \end{figure} \subsubsection{Category Three} \label{defectrescat3} The third and final category is the one in which manualUp wins overall, and the classic learners perform better than Which2. This category contains only one file and it also represents the one time out of the eight that Which2 does not perform well. \begin{figure}[!ht] \begin{center} \resizebox{100mm}{!}{\includegraphics{figures/chapter4/MDP/plots/mc2_mod.png}} \caption{MDP Data Files where all $\>$ Which. This is the data file, mc2\_mod.} \label{fig:mdpreswhichlose} \end{center} \end{figure} \section{Turkey Data} \label{turkeydata} \subsection{Experiment} \label{turkeyexp} \subsection{Results} \label{turkeyresults} \section{\att Data} \label{atntdata} \subsection{Experiment} \label{atntexp} \subsection{Results} \label{atntresults} \section{Micro Sampling} \label{micro} \subsection{What is Microsampling?} \label{microwhatis} \subsection{Data Description} \label{microdata} Used the \sect{defectmdp} data for this experiment. \subsection{Experiment} \label{microexp} \subsection{Results} \label{microresults}