5d4 < \usepackage{amssymb} 53,57c52,54 < Our goal is intelligent adapative controllers of running < programs that are monitored by < temporal logic invariants. We want to find bugs, or ways < to dodge them. Hence, we want < data miners to watch --- > Our goal is intelligent adapative runtime monitoring of procedural > programs augmented with temporal logic invariants. We want > data miners to monitor 59c56 < constraints to input ranges which increase/decrease the number of logic --- > constraints to input ranges which increase the number of logic 148,149c145 < experiments on 20 UCI data sets~\cite{Blake+Merz:1998}: {\em < \{A:heart-c, --- > experiments on 20 UCI data sets~\cite{Blake+Merz:1998}. A:heart-c, 153,160c149,150 < T:housing\}}. Data sets A..J have discrete classes and are scored < via the {\em accuracy} of the learned theory; i.e < \% successful classifications. < Data sets K..T < have continuous classes and are scored by the {\em PRED(30)} of the learned < theory; i.e. what \% of the estimated values are within 30\% of the actual < value. < Data sets are sorted according to how many --- > T:housing. Data sets A..J have discrete classes while Data sets K..T > have continuous classes. Data sets are sorted according to how many 277c267 < number of $era$s of size $E$; i.e. $W=nE$ (default: {\em E=150} instances). --- > number of $era$s of size $E$; i.e. $W=nE$ (default: {\em E=100} instances). 291c281 < \frac{S_j - (\frac{\sum_i S_x}{\sum_i E}*E)}{ --- > \frac{\frac{S_j}{E} - \frac{\sum_i S_x}{\sum_i E}}{ 377c367 < numeric attribute using a single gaussian. --- > numeric attribute using a single gaussian. 387,391c377,379 < John and Langley comment that their method < must access all the individual numeric values to build their < kernel estimator and this is impractical for large data sets. < Many discretization methods violate the {\em one scan} < requirement of a data miner: i.e. execute using only --- > Many kernel estimation and discretization methods violate the {\em one scan} > requirement of a data miner; > i.e. learning needs only 406c394,396 < --- > Similarly, John and Langley's kernel estimation method > can't build its distribution > until {\em after} seeing all the data. 410d399 < \section{Handling Numeric Attributes with SPADE} 411a401 > \section{Handling Numeric Attributes with SPADE} 421,435c411,413 < SPADE only scans the input data once and, < at anytime during the processing < of $X$ instances, SPADE's bins are available. < Further, if it ever adjusts bins (e.g. when merging bins with very small tallies), < the information used for that merging comes < from the bins themselves, and not some second scan of the instances. < Hence, it can be used for the incremental processing of very < large data sets. < < N\aive Bayes classifiers usually make some gaussian assumption about the underlying < numeric distributions. < SPADE makes no such assumptions. < It is similar to {\em 10-bins} but the MIN and MAX change incrementally. < The first value $N$ creates one bin and sets \{MIN=N,MAX=N\}. < If a subsequent new value arrives inside the current --- > SPADE is like {\em 10-bins} but the MIN and MAX change incrementally. The first > value $N$ creates one bin and sets \{MIN=N,MAX=N\}. If a subsequent > new value arrives inside the current 467,489d444 < Preventing the creation of very few bins with big tallies is essential < for a practical < incremental discretizer. < Hence, SPADE checks for merges only occasionally (at the end of each era), < thuss allowing for the generation of multiple bins before they are merged. < < SPADE runs as a pre-processor to {\tt update} to N\aive Bayes. Newly arrived < numerics get placed into bins and it is this bin number that is used < as the $value$ is passed to {\tt update} or \fig{class}. Also, when SPADE < merges bins, this causes a similar merging in frequncy tables entries < (the $F$ variable of \fig{class}). < < The opposite of merging would be to {\em split} bins with unusually < large tallies. SPADE has no split operator since we did not know how < to best divide up a bin {\em without} keeping per-bin kernel < estimation data (which would be memory-expensive). < Our early experiments suggested that adding < $SubBins=5$ new bins between old ranges and newly arrived out-of-range < values was enough to adequately divide the range. Our subsequent < experiments (see below) were so encouraging that we are not motivated < to add a split operator. < < 511a467,475 > SPADE only scans the input data once and, > at anytime during the processing > of $X$ instances, SPADE's bins are available. > Further, if it ever adjusts bins (e.g. when merging), > then the information used for that merging comes > from the bins themselves, and not some second scan of the instances. > Hence, it can be used for the incremental processing of very > large data sets. > 512a477,535 > \begin{figure}[!b] > {\scriptsize > \begin{center} > \includegraphics[width=1.4in]{data/kdd.pdf} > \includegraphics[width=1.4in]{data/kddbands.pdf} > \end{center}} > \caption{SAWTOOTH and the KDD'99 data}\label{fig:kdd99} > \end{figure} > > \begin{figure*} > \begin{center} > \begin{minipage}{.6\linewidth} > \begin{center} > {\scriptsize > \begin{tabular}{r@{~}r|@{~}l|l@{}c@{~}l@{~}c@{~}l@{}c@{}lc@{}r|}\cline{3-9} > \multicolumn{2}{c}{}& \multicolumn{1}{|c|}{incremental} &\multicolumn{6}{c|}{batch}\\\cline{3-9} > data set & \#instances & SAWTOOTH & NB & & nbk&&j48\\\hline > \input{data/10way.dat} > \end{tabular}} > \end{center} > \end{minipage}\hfill > \begin{minipage}{.35\linewidth} > \begin{center} > \includegraphics[width=2in]{data/10way.pdf} > ~\\ > ~\\~\\~\\ > {\scriptsize > \begin{tabular}{|c|c|c|c|c|c|c|} \hline > learner & win - loss & win & loss & ties \\ \hline > J48 & 10 & 25 & 15 & 20 \\ > nbk & 8 & 17 & 9 & 34\\ > SAWTOOTH & -6 & 12 & 18 & 30\\ > NB & -12 & 9 & 21 & 30 \\ > \hline > \end{tabular} > %~\newline~\newline > }\end{center} > \end{minipage} > \end{center} > \caption{$mean \pm standard\; deviations$ > seen in 10*10-way cross validation > experiments on UCI Irvine data sets. > ``NB'' and ``nbk'' denote N\aive Bayes > classifiers that use gaussians > model continuous attributes. ``NB'' uses a single > gaussian while ``nbk'' uses a sum of gaussians in the method > recommended by John and Langley~\cite{john95}. > The plot top-right sorts the differences in the accuracies > found by SAWTOOTH and all the other learners. > Some of those differences aren't statistically significant: > the ``+'' or ``-'' in the left-hand-side table denote > mean differences that are significantly different > to SAWTOOTH > at the $\alpha=0.05$ level. > The significant differences between > all the learners are shown in > the win-loss statistics of the bottom-right table. > }\label{fig:uci} > \end{figure*} 514c537 < \fig{spade} compares results from SPADE and John and Langley's kernel estimation method --- > \fig{spade} compares results from SPADE and John and Langley's kernel estimation method 536,537d558 < < 549,552c570,573 < were still within 3\% of kernel estimation. < In our view, the advantages < of SPADE (incremental, one scan processing, distribution independent) compensates < for its occasional performing worse --- > were still within 3\% of kernel estimation. > In our view, > the advantages > of SPADE (incremental, one scan processing) compensates for its occasional performing worse 558a580 > 574a597 > 577,587c600,618 < In order to stress test our system, we ran it on the 5,300,000 < instances used in the 1999 KDD < cup\footnote{\url{http://www.ai.univie.ac.at/~bernhard/kddcup99.html}}. < KDD'99 dealt with network intrusion detection and was divided < into a training set of about five million instances and a {\em test < set} of 311,029 instances. The data comprised 6 discrete attributes, < 34 continuous attributes, and 38 classes which fell into four main < categories: {\em normal} (no attack); {\em probe} (surveillance and < other probing); {\em DOS} (denial-of-service); {\em U2R} (unauthorized < access to local super-user privileges); and {\em R2L} (unauthorized < access from a remote machine). --- > In order to stress test our system, > we ran it on the 5,300,000 instances > used in the 1999 KDD cup\footnote{\url{http://www.ai.univie.ac.at/~bernhard/kddcup99.html}}. > The KDD data dealt with > network intrusion > detection and was > divided into > a training set > of about five million instances and a {\em > test set} of 311,029 instances. > The data comprised > 6 discrete attributes, 34 continuous attributes, and > 38 classes which > fell into four main > categories: {\em normal} (no attack); > {\em probe} (surveillance and other probing); > {\em DOS} (denial-of-service); > {\em U2R} (unauthorized access to local super-user privileges); > and {\em R2L} (unauthorized access from a remote machine). 596,697c627,651 < \fig{kdd99} shows all the sorted {\em mean M*C scores} < from the KDD'99 entrants. < Also shown in that figure is SAWTOOTH's mean \mbox{$M*C$} result. < SAWTOOTH's results < were close to the winning score of entrant \#1; very similar to < entrants 10,11,12,13,14,15,16; and much better than entrants < 18,19,20,21,22,23,24. These results are encouraging since SAWTOOTH is < a much simpler tool that many of the other < entries. For example, the winning entrant < took several runs to divide the data into smaller < subsets and buid an ensemble of 50x10 C5 decision trees using an < intricate cost-sensitive bagged boosting technique. This took more < than a day to terminate on a dual-processor 2x300MHz Ultra-Sparc2 < machine with 512MB of RAM using the commercially available < implementation of C5, written in ``C''. In contrast, our toolkit < written in interpreted scripting langauges (gawk/bash), < processed < all 5,300,000 instances in one scan of the data using less than 3.5 < Megabytes of memory. This took 11.5 hours on a 2GHz Pentium 4, with 500MB of < RAM, runing Windows/Cygwin < and we conjecture that that this < runtime could be greatly reduced by porting our toolkit to ``C''. < < Another encouraging result is the {\em \# bins with tally=X} plot of < \fig{kdd99}. One concern with SPADE is that several of its internal < parameters are linked to the number of processed instances; e.g. {\em < MaxBins} is the square root of the number of instances. The 5,300,000 < instances of KDD'99 could therefore generate an impractically large number < of bins for each numeric attribute. This worst-case scenario < would occur if each consecutive group of $SubBins$ number of numeric < values has different values from the previously seen groups {\em and} < they are sorted in ascending or descending order. If this < unlikely combination of events does {\em not} occur then the resulting < bins would have tallies than < $MinInst$, encouraging it to merge with the next bin. < In all our experiments, < we have never seen this worst-case behavior. < In KDD'99, for example, < SPADE only ever generated 2 bins for 20 of the 40 attributes. Also, < for only two of the attributes, did SPADE generate more than 50 < bins. Further, SPADE never generated more than 100 bins for any attribute. < \begin{figure}[!t] < {\scriptsize < \begin{center} < \includegraphics[width=1.4in]{data/kdd.pdf}~~~~ < \includegraphics[width=1.4in]{data/kddbands.pdf} < \end{center}} < \caption{SAWTOOTH and the KDD'99 data}\label{fig:kdd99} < \end{figure} < \newcommand{\upp}{$\blacktriangle$} < \newcommand{\downn}{$\blacktriangledown$} < < \begin{figure*} < \begin{center} < \begin{minipage}{.6\linewidth} < \begin{center} < {\scriptsize < \begin{tabular}{l@{~}r|@{~}l|l@{}c@{~}l@{~}c@{~}l@{}c@{}lc@{}r|}\cline{3-9} < \multicolumn{2}{c}{}& \multicolumn{1}{|c|}{incremental} &\multicolumn{6}{c|}{batch}\\\cline{3-9} < data set & \#instances & SAWTOOTH & NB & & nbk&&j48\\\hline < \input{data/10way.dat} < \end{tabular}} < \end{center} < \end{minipage}\hfill < \begin{minipage}{.35\linewidth} < \begin{center} < \includegraphics[width=2in]{data/10way.pdf} < ~\\ < ~\\~\\~\\ < {\scriptsize < \begin{tabular}{|c|c|c|c|c|c|c|} \hline < learner & win - loss & win & loss & ties \\ \hline < J48 & 10 & 25 & 15 & 20 \\ < nbk & 8 & 17 & 9 & 34\\ < SAWTOOTH & -6 & 12 & 18 & 30\\ < NB & -12 & 9 & 21 & 30 \\ < \hline < \end{tabular} < %~\newline~\newline < }\end{center} < \end{minipage} < \end{center} < \caption{$mean \pm standard\; deviations$ < seen in 10*10-way cross validation < experiments on UCI Irvine data sets. < ``NB'' and ``nbk'' denote N\aive Bayes < classifiers that use gaussians < model continuous attributes. ``NB'' uses a single < gaussian while ``nbk'' uses a sum of gaussians in the method < recommended by John and Langley~\cite{john95}. < The plot top-right sorts the differences in the accuracies < found by SAWTOOTH and all the other learners. < Some of those differences aren't statistically significant: < the {\upp} or {\downn} symbols in the left-hand-side table denote < mean differences that are significantly more or less (respectively) < to SAWTOOTH < at the $\alpha=0.05$ level. < The significant differences between < all the learners are shown in < the win-loss statistics of the bottom-right table. < }\label{fig:uci} < \end{figure*} --- > \fig{kdd99} shows the {\em mean M*C scores} > for SAWTOOTH and the KDD'99 entrants. > SAWTOOTH's mean $M*C$ > results were > close to the winning > score of entrant \#1; > very similar to entrants 10,11,12,13,14,15,16; and > better than > entrants 18,19,20,21,22,23,24. > These results are encouraging since SAWTOOTH is a much simpler tool that the winning > entry (which classified the security incidents using > an ensemble of decision trees built from a 50*10 cross-val). > > Another encouraging result is the {\em \# bins with tally=X} > plot of \fig{kdd99}. One concern with > SPADE is that several of its internal > parameters are linked to the number of processed > instances; e.g. {\em MaxBins} is the square root of the number of instances. > The 5,300,000 instances of KDD'99 could therefore, in the worst case, > generate thousands of bins for each numeric attribute. In all our experiments, > we have never seen this worst-case behavior. In KDD'99, for example, > SPADE only ever generated > 2 bins for 20 of the 40 attributes. Also, for only two > of the attributes, did SPADE generate more than 50 bins. Lastly, > SPADE never generated more than 100 bins. 707,717d660 < Attempts to test our system using other KDD cup data were not successful, for a variety < of reasons\footnote{The KDD'04 evaluation portal was off-line < during the period when SAWTOOTH was being developed. < The KDD'03 problem required feature < extration from free text- something that is < beyond the scope of this research. < The data for KDD'02 is no longer on-line. < The KDD'01 had data with 130,000 attributes < and we don't yet know how to extend our technqie to < such a large attribute space. < We had trouble following the KDD'00 documentation.}. 735c678 < usually performs worse than learning from the total --- > usually performs worse that learning from the total 905,906c848 < %\bibliography{../../refs} < \bibliography{refs} --- > \bibliography{../../refs}