%\documentclass[onecolumn,draftcls,journal,compsoc]{IEEEtran} %\documentclass[technote]{IEEEtran} %\documentclass{sig-alternate} \documentclass[a4paper,12pt]{article} \usepackage{graphicx} \usepackage{wrapfig,url} \newcounter{over} \newcounter{max} \newcommand{\fred}[2]{\setcounter{over}{#2}\addtocounter{over}{#1}} \newcommand{\boxplot}[5]{% \scalebox{0.5}{\textcolor{white}{\setcounter{max}{#4}\addtocounter{max}{#5}\fred{#4}{-#2}}% \begin{picture}(100,10)% \put(0,0){\line(0,1){8}}% \put(100,0){\line(0,1){8}}% %\put(#1,4){\circle*{1}}% %\put(\themax,4){\circle*{1}}% \put(#2,4){\line(1,0){\theover}}% \put(#3,4){\circle*{8}}% %\put(#4,4){\line(1,0){#5}}% \put(50,0){\line(0,1){8}}% \end{picture}} } \usepackage{eso-pic} \usepackage{color} \usepackage{type1cm} %\makeatletter % \AddToShipoutPicture{% % \setlength{\@tempdimb}{.5\paperwidth}% % \setlength{\@tempdimc}{.5\paperheight}% % \setlength{\unitlength}{1pt}% % \makebox(600,900){\rotatebox{45}{\textcolor[gray]{0.80}{\fontsize{5cm}{5cm}\selectfont{Draft v5}}}}} %\makeatother \newenvironment{smallitem} {\setlength{\topsep}{0pt} \setlength{\partopsep}{0pt} \setlength{\parskip}{0pt} \begin{itemize} \setlength{\leftmargin}{.2in} \setlength{\parsep}{0pt} \setlength{\parskip}{0pt} \setlength{\itemsep}{0pt}} {\end{itemize}} \newenvironment{smallenum} {\setlength{\topsep}{0pt} \setlength{\partopsep}{0pt} \setlength{\parskip}{0pt} \begin{enumerate} \setlength{\leftmargin}{.2in} \setlength{\parsep}{0pt} \setlength{\parskip}{0pt} \setlength{\itemsep}{0pt}} {\end{enumerate}} \newcommand{\bd}{\begin{description}} \newcommand{\ed}{\end{description}} \newcommand{\bi}{\begin{smallitem}} \newcommand{\ei}{\end{smallitem}} \newcommand{\be}{\begin{smallenum}} \newcommand{\ee}{\end{smallenum}} \newcommand{\fig}[1]{Figure~\ref{fig:#1}} \newcommand{\eq}[1]{Equation~\ref{eq:#1}} \newcommand{\tion}[1]{\S\ref{sec:#1}} %\usepackage{times} \usepackage{alltt} \usepackage{url} % IEEE Computer Society needs nocompress option % requires cite.sty v4.0 or later (November 2003) \usepackage{cite} % normal IEEE % \usepackage{cite} % correct bad hyphenation here \hyphenation{op-tical net-works semi-conduc-tor} \pdfpagewidth=8.5in \pdfpageheight=11in \begin{document} %\conferenceinfo{PROMISE'09,} {May 18--19, 2009, Vancouver, Canada} %\conferenceinfo{~} {~} %\CopyrightYear{2009} %\crdata{978-1-60558-036-4/08/05} %\CopyrightYear{2008} % Allows default copyright year (200X) to be over-ridden - IF NEED BE. %\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE. \title{Open Source Data Mining with OURMINE} \maketitle %\titlenote{Submitted to the PROMISE 2008 workshop on repeatable SE experiments. %%A draft version of this paper may be downloaded fro %\url{http://menzies.us/pdf/08keys.pdf} %} \begin{center} \author{Adam Nelson, \and Tim Menzies, \and Gregory Gay\\ CSEE, WVU, Morgantown, WV\\ \texttt{anelson8@mix.wvu.edu, tim@menzies.us, gregoryg@csee.wvu.edu}} \end{center} \begin{abstract} In this paper we discuss OURMINE as a data mining environment for the development and deployment of experiments, as well as its application in both data mining instruction and acquisition. Finally, we introduce two recent experiments conducted using OURMINE. The first experiment is to determine how defect predictors learned from one site can apply to another using PROMISE data. This is important because it shows that not only can this data be used for academic purposes, but also in industry; companies can use PROMISE data to build their own defect predictors when local data is not available. The second experiment conducted explores benefits of faster clustering and reduction algorithms over more rigorous methods using cluster inter/intra similarities and purity measures. \end{abstract} \section{Introduction} Open source environments are abundantly available for data mining. Tools such as ``R''\footnote{\url{http://www.r-project.org/}}, ORANGE\footnote{\url{http://magix.fri.uni-lj.si/orange/}} (see \fig{orange}), MATLAB \footnote{\url{http://www.mathworks.com}}, and even WEKA \footnote{\url{http://www.cs.waikato.ac.nz/ml/weka/}}~\cite{witten05} from the Waikato University's Computer Science Machine Learning Group, which was given the 2005 SIGKDD Data Mining and Knowledge Discovery Service Award. While these tools are noteworthy, used by hundreds of data miners for both teaching and experimentation, our issue with these tools is the same as raised by Ritthoff et al.~\cite{ritthoff01}. Data mining in the real world is complex, and requires leveraging the combination of a multitude of tools including data pre-processors, data miners and report generators instead of simply running a single algorithm. We agree with Ritthoff et al. that an interface given by a tool such as WEKA does not support fast generation of these required combinations. The demand for connectivity between data miners in other tools has yielded many results. For instance, WEKA introduced a visual programming environment where nodes denote data pre-processors and miners, etc. These nodes are then connected by arcs, representing data flows between them. Similarly, ORANGE provides a visual representation, as seen in \fig{orange}. But while these visual environments are important, they are not always beneficial. In fact, according to Gay et al.~\cite{gay09}, data mining students find these visual environments discouraging or distracting due to tool complexity or wasted time in the construction of these environments instead of actually {\em data mining}. \begin{figure}[!t] \begin{center} \includegraphics[width=2.25in]{weka.jpg} \end{center} \caption{WEKA.}\label{fig:weka} \end{figure} \begin{figure}[!t] \begin{center} \includegraphics[width=2.25in]{orange.jpg} \end{center} \caption{Orange.}\label{fig:orange} \end{figure} As we will discuss in this paper, in our experience, OURMINE provides a preferable environment for building and sharing publishable experiments over existing data mining toolkits. Thus, we present two experiments conducted using OURMINE whose results could be inserted into a research paper or thesis. \section{OURMINE} OURMINE is a toolkit being used at West Virginia University to not only teach data mining through small examples, but also to provide an environment in which to conduct even large, time consuming experiments. The toolkit operates on UNIX-based operating systems and as such uses shell scripting at its core. As a result, this allows any tool that can be executed from a command prompt to be seemlessly ``linked'' with other tools. As an example, you see in \fig{clean} a simple bash function used in OURMINE to clean text data before conducting any experiments using it. Line 6 shows passing text from a file, performing tokenization, removing capitals and unimportant words found in a stop list, and then in the next line performing Porter's stemming algorithm on the result. The modules shown are written using BASH, awk and perl. Therefore, OURMINE allows connectivity between tools written in various languages as long as there is always a command-line API available for each tool. \begin{figure} ~\hrule~ {\scriptsize\begin{verbatim} 1 clean(){ 2 local docdir=$1 3 local out=$2 4 5 for file in $docdir/*; do 6 cat $file | tokes | caps | stops $Lists/stops.txt > tmp 7 stems tmp >> $out 8 rm tmp 9 done 10 }\end{verbatim}} ~\hrule~ \caption{An OURMINE function to clean text documents and collect the results.}\label{fig:clean} \end{figure} The following sections describe OURMINE's functions and applications. \subsection{Built-in Data and Functions} OURMINE comes with the following public domain datasets to begin conducting experiments right after installation (found in the appendix): \bi \item Text, including STEP datasets (numeric): ap203, ap214, bbc, bbcsport, law, 20 Newsgroup subsets [sb-3-2, sb-8-2, ss-3-2, sl-8-2]\footnote{\url{http://mlg.ucd.ie/datasets}}} \item UCI (discrete): anneal, colic, hepatitis, kr-vs-kp, mushroom, sick, waveform-5000, audiology, credit-a, glass, hypothyroid, labor, pcolic, sonar, vehicle, weather, autos, credit-g, heart-c, ionosphere, letter, primary-tumor, soybean, vote,\newline weather.nominal, breast-cancer, diabetes, heart-h, iris, lymph, segment, splice, vowel \item UCI (numeric): auto93, baskball, cholesterol, detroit, fruitfly, longley, pbc, quake, sleep, autoHorse, bodyfat, cleveland, echoMonths, gascons, lowbwt, pharynx, schlvote, strike, autoMpg, bolts, cloud, elusage, housing, mbagrade, pollution, sensory, veteran, autoPrice, breastTumor, cpu, fishcatch, hungarian, meta, pwLinear, servo, vineyard \item PROMISE (discrete): CM1, KC1, KC2, KC3, MC2, MW1, PC1 \ei OURMINE also comes with a variety of built-in functions to perform data mining and text mining tasks. A few of these functions can be seen in \fig{functions}. \subsection{Adding/Modifying Code} Adding custom scripts to OURMINE is done by either supplying a new executable BASH file or by only adding to/modifying functions in the already existing code. Either method can be done quickly without a lot of knowledge of either BASH or of how OURMINE's file structure works. For example, if a user wished to add a {\em myFunctions.sh} to be used in the environment \begin{itemize} \item{Create an executable {\em myFunctions.sh} containing any custom scripts} \item{Include this file in {\em \$HOME/opt/ourmine/our/lib/sh/minerc.sh }} \end{itemize} To add to or modify code in already existing files, an edit and save is all that is required. Once this basic structure of OURMINE is learned, however, more advanced additions can be made such as including multiple directories of BASH files, some of whose functions call code written in other files and other languages located elsewhere in the codebase. This gives a nearly limitless opportunity for customization. \subsection{Learning and Teaching with OURMINE} Standard data mining concepts can sometimes appear to be overly complex when implemented using an intricate, highly specific system such as those found in WEKA, etc. For this reason, OURMINE utilizes scripting using BASH~\cite{ramey94} and GAWK~\cite{awkbook} to better convey the inner-workings of these concepts. For instance, \fig{tfidf} shows a GAWK implementation used by OURMINE to determine the TF-IDF~\cite{ramos03} values of each term in a document. This script is simple and concise, while a C++ or Java implementation would be large and overly complex. GAWK is used in OURMINE for its simplicity and power. An associate professor of Computer Science at Washington University in St. Louis, R. Loui, encourages the use of GAWK in his artificial intelligence classes, and writes: \begin{quote}{\em There is no issue of user-interface. This forces the programmer to return to the question of what the program does, not how it looks. There is no time spent programming a binsort when the data can be shipped to /bin/sort in no time.} ~\cite{gawkai} \end{quote} \begin{figure} ~\hrule~ {\scriptsize\begin{verbatim} #update counters for all words in the record function train() { Documents++; for(I=1;I produced.dat 22 for goal in $goals; do 23 cat produced.dat | 24 abcd --prefix "`basename $dat`,$run,$bin,$learner,$goal" \ 25 --goal "$goal" \ 26 --decimals 1 27 done 28 done 29 done 30 blabln 31 done 32 done | sort -t, -r -n -k 11,11) | malign > $out 33 less $out 34 } \end{verbatim}} ~\hrule~ \caption{A demo OURMINE experiment. The worker function cycles through specified learners over a data set, and analyzes the results to find {\em a,b,c,d,accuracy,pd,pf,precision} and {\em balance} values.} \label{fig:demo004} \end{figure} Function documentation provides a way for newcomers to OURMINE to not only get to know the workings of each function, but also add to and modify the current documentation. Instead of asking the user to implement a more complicated ``man page'', OURMINE uses a very simple system consisting of keywords such as {\em name, args, eg} to represent a function name, its arguments and an example of how to use it. Using this documentation is simple. Entering {\em funs} at the OURMINE prompt provides a sorted list of all available functions in ourmine. Then, by typng {\em help X}, where {\em X} is the name of the function, information about that function is printed to the screen. See \fig{help} for an example of viewing the help document for the function {\em j4810}. Documentation for a function is added by supplying a text file to the helpdocs directory in OURMINE named after the function. From the teaching perspective, demonstrating on-the-fly a particular data mining concept helps not only to solidify this concept, but also gets the student accustomed to using OURMINE as a tool in the course. As an example, if a Naive Bayes classifier is introduced as a topic in the class, an instructor can show the workings of the classifier by hand and a calculator, and then immediately afterwords, compliment this by running Naive Bayes on a small dataset in OURMINE. Also, since most of OURMINE does not use pre-compiled code, an instructor can make live changes to the scripts and quickly show the results. \fig{demo004} shows a data mining experiment to be used as a demo. Once a student learns the basic mechanics of this demo experiment, its framework can be copied to another function within OURMINE to conduct a custom experiment using any number of tools. \begin{figure}[!t] ~\hrule~ {\scriptsize\begin{verbatim} Function: j4810 Arguments: Example(s): j4810 weather.arff Description: Uses a j48 decision tree learner on the input data Function Code: ============== j4810 () { local learner=weka.classifiers.trees.J48 $Weka $learner -C 0.25 -M 2 -i -t $1 } \end{verbatim}} ~\hrule~ \caption{Function help in OURMINE.}\label{fig:help} \end{figure} \begin{figure}[!t] \small ~\hrule~ \begin{description} \item[abcd] provides analysis of experiments, such as {\em pd,pf,balance} and {\em precision} values; \item[clean] clean text for further processing, removing tokens, capitalizations, stop words, etc.; \item[docsToSparff] constructs a sparse arff file based on a directory of documents; \item[docsToTfidfSparff] generates a sparse arff file of TF-IDF values based on a directory of documents; \item[funs] shows a sorted list of all available functions in OURMINE; \item[logArff] logs all numeric values in a data set ; \item[malign] neatly aligns text into columns; \item[nb] Runs Naive Bayes on the data given; \item[rankViaInfoGain] ranks attributes by InfoGain values; \item[makeTrainAndTest] splits a dataset into a test set and a training set as $train$.arff and $test$.arff, as well as $train$.lisp and $test$.lisp. \end{description} ~\hrule~ \caption{OURMINE functions give the user something with which to start and begin running demos and experiments.}\label{fig:functions} \end{figure} \section{Using Ourmine for Research} In order to demonstrate OURMINE as not only a tool used for data mining instruction and acquisition, or on only smaller experiments conducted in a classroom setting, the first author of this paper has reproduced variations of two recent, publishable experiments using OURMINE. The second experiment uses code written entirely by the first author, while the first uses scripts written by all authors of this paper. The first experiment, based on those conducted by Turhan et al. ~\cite{turhan08}, as well as Gay et al., focuses on binary defect prediction. In ~\cite{turhan08}, Turhan et al. conducted three experiments to rule in favor of cross-company (CC) data obtained from other sites, or within-company (WC) data gathered locally. The conclusions of those experiments show that CC data, when applied using {\em relevancy filtering}, as explained below, can lead to defect predictors almost as effective as WC data. Thus, as stated by Gay et al., ``...while local data is the preferred option, it is feasible to use imported data provided that it is selected by a relevancy filter.'' The second experiment was a small-scale reproduction of that conducted by a graduate computer science student at West Virginia University to be used in a Master's thesis. The purpose of the experiment is to show that faster heuristic means of clustering and dimensionality reduction yield results comparable to slower, more rigorous methods when examining these methods' runtimes, cluster similarities, as well as cluster purities. In the following sections we describe the experiments and show some of the scripts used to conduct them. Finally we analyze the results. \subsection{Experiment I} OURMINE was used to reproduce Turhan et al.'s experiment; with a Naive Bayes classifier in conjunction with a k-Nearest Neighbor (k-NN) relevancy filter. Relevancy filtering is used to group similar instances together in order to obtain a learning set that is homogeneous with the testing set. Thus, by using a training set that shares similar characteristics with the testing set, it is assumed that a bias in the model will be introduced. The k-NN filter works as follows: for each instance in the test set, the k nearest neighbors in the training set are chosen. Then, duplicates are removed and the remaining instances are used as the new training set. Recently in the paper~\cite{gay09}, Gay et al. confirmed that using a relevancy filter on CC data is nearly as good as WC data, however in that experiment, a locally weighted scheme was used as the filter via {\em lwl}~\cite{hallLWL}. \subsubsection{Building the Experiment} A sample of the script used to conduct this experiment is shown in \fig{promiseExp}. The complete script can be located as per directions in the appendix. To begin, the data in this study were the same as used by Gay et al.; seven PROMISE defect data sets (CM1, KC1, KC2, KC3, MC2, MW1, PC1) were used to build seven combined data sets each containing $\frac{6}{7}$-th of the data. For instance, the file {\em combined\_PC1.arff} contains all seven data sets {\em except} PC1, which is used as the training set for the cross-company (CC) data. Next, as can be seen in line 15 of \fig{promiseExp}, a 10-way cross validation was conducted by calling $makeTrainAndTest$, which is a built-in OURMINE function that randomly shuffles the data and constructs both a test set, containing 10\% of the data, and a training set containing 90\% of the data. This was repeated ten times, and the resulting data was used in proceeding studies. For instance, in lines 18-24, the original test and training sets are used for the first WC study. However, in the WC experiment using relevancy filtering (lines 25-33), the same test set is used, but with the newly created training set. Lines 34-41 show the first CC study. This study is identical to the WC study except that as we saw before, we use $combined\_X.arff$ files, instead of $shared\_X.arff$. We chose to use a Naive Bayes classifier for this study because this is what was chosen in the original experiment conducted by Turhan et al. in ~\cite{turhan08}, as well as because Naive Bayes has been shown to perform better on PROMISE defect data sets than other learners~\cite{lessmann08}. \begin{figure}[!t] ~\hrule~ {\scriptsize\begin{verbatim} 1 promiseDefectFilterExp(){ 2 local learners="nb" 3 local datanames="CM1 KC1 KC2 KC3 MC2 MW1 PC1" 4 local bins=10 5 local runs=10 6 local out=$Save/defects.csv 7 for((run=1;run<=$runs;run++)); do 8 for dat in $datanames; do 9 combined=$Data/promise/combined_$dat.arff 10 shared=$Data/promise/shared_$dat.arff 11 blabln "data=$dat run=$run" 12 for((bin=1;bin<=$bins;bin++)); do rm -rf test.lisp test.arff train.lisp train.arff 13 cat $shared | 14 logArff 0.0001 "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19" > logged.arff 15 makeTrainAndTest logged.arff $bins $bin 16 goals=`cat $shared | getClasses --brief` 17 for learner in $learners; do 18 blabln "WC" 19 $learner train_shared.arff test_shared.arff | gotwant > produced.dat 20 for goal in $goals; do 21 cat produced.dat | abcd --prefix "$dat,$run,$bin,WC,$learner,$goal" \ 22 --goal "$goal" \ 23 --decimals 1 24 done 25 blabln "WCkNN" 26 rm -rf knn.arff 27 $Clusterers -knn 10 test_shared.arff train_shared.arff knn.arff 28 $learner knn.arff test_shared.arff | gotwant > produced.dat 29 for goal in $goals; do 30 cat produced.dat | abcd --prefix "$dat,$run,$bin,WkNN,$learner,$goal" \ 31 --goal "$goal" \ 32 --decimals 1 33 done 34 blabln "CC" 35 makeTrainCombined $combined > com.arff 36 cat com.arff | logArff 0.0001 "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19" > logged.arff 37 $learner logged.arff test_shared.arff | gotwant > produced.dat 38 for goal in $goals; do 39 cat produced.dat | abcd --prefix "$dat,$run,$bin,CC,$learner,$goal" \ 40 --goal "$goal" \ 41 --decimals 1 ...\end{verbatim}} ~\hrule~ \caption{A sample of the script used in conducting the WC vs. CC experiment.}\label{fig:promiseExp} \end{figure} \subsubsection{Results} Our results for this experiment can be found in \fig{pds} and \fig{pfs}. \fig{pds} shows $pd$ (probability of detection) values sorted in decreasing order, while \fig{pfs} shows $pf$ (probability of false alarm) values sorted in increasing order. Note that a higher $pd$ is better, while a lower $pf$ is better. The last column of each figure shows quartile charts of the methods' $pd$ and $pf$ values. The black dot in the center of each plot represents the median value, and the line going from left to right from this dot show the second and third quartile respectively. Column one of each figure gives a method its rank based on their results of a Mann-Whitney test at 95% confidence. A rank is determined by how many times a learner or learner/filter loses compared to another. The method that lost the least number of times is given the highest rank. \subsection{Experiment II} As stated above, the purpose of this experiment conducted for this paper is to verify if heuristic clustering/reduction methods outperform slower, more thorough and rigorous ones when comparing runtimes The datasets used in this experiment are: \begin{itemize} \item{EXPRESS schemas: AP-203, AP-214} \item{Text mining datasets: BBC, Reuters, The Guardian (multi-view text datasets), 20 Newsgroup subsets [sb-3-2, sb-8-2, ss-3-2, sl-8-2]\footnote{\url{http://mlg.ucd.ie/datasets}}} \End{itemize} \subsubsection{Clustering} The process of clustering data into similar groups can be used in a wide variety of applications, such as: \begin{itemize} \item{Marketing: finding groups of customers with similar behaviors given a large database of customer data} \item{Biology: classification of plants and animals given their features} \item{WWW: document classification and clustering weblog data to discover groups of similar access patterns \footnote{\url{http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/}}} \end{itemize} Thus the purpose of clustering is to ``determine the intrinsic grouping in a set of unlabeled data'' \footnotemark[\value{footnote}]. \subsubsection{Reduction Methods} Data can sometimes be overwhelmingly large, containing a great number of attributes, or dimensions. In order to reduce this vast multidimensional space, there exist dimensionality reduction methods. Dimensionality reduction filters the attributes, and attempts to select the most valid ones. In ``A Survey of Dimension Reduction Techniques'' \footnote{\url{https://e-reports-ext.llnl.gov/pdf/240921.pdf}} by I.K. Fodor, Fodor writes: \begin{quote} {\em One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables are ``important'' for understanding the underlying phenomena of interest.} \end{quote} In other words, in order to reduce the size (and inevitable number of computations) of our data, our goal is to extract the most relevant information and throw out the rest. \subsubsection{The Algorithms} While there are many clustering algorithms used today, this experiment focused on three: a naive K-Means implementation, GenIc~\cite{genic04}, clustering using canopies~\cite{canopies00}. K-means, a special case of a class of EM algorithms, works as follows: \begin{enumerate} \item{Select initial {\em K} centroids at random;} \item{Assign each incoming point to its nearest centroid;} \item{Adjusts each cluster's centroid to the mean of each cluster;} \item{Repeat steps 2 and 3 until the centroids in all clusters stop moving by a noteworthy amount} \end{enumerate} Here we use a naive implementation of K-means, requiring {\em K}*{\em N }*{\em I} comparisons, where {\em N} and {\em I} represent the total number of points and maximum iterations respectively. GenIc is a single-pass, stochastic clustering algorithm. It begins by intially selecting {\em K } centroids at random from all instances in the data. At the beginning of each generation, set the centroid weight to one. When new instances arrive, nudge the nearest centroid to that instance and increase the score for that centroid. In this process, centroids become ``fatter'' and slow down the rate at which they move toward newer examples. When a generation ends, replace the centroids with less than {\em X} percent of the max weight with {\em N} more random centroids. Repeat for many generations and return the highest scoring centroids. Canopy clustering, developed by Google, reduces the need for comparing all items in the data using an expensive distance measure, by first partitioning the data into overlapping subsets called {\em canopies}. Canopies are first built using a cheap, approximate distance measure. Then, more expensive distance measures are used inside of each canopy to cluster the data. PCA, or Principal Components Analysis, is a reduction method that treats every instance in a dataset as a point in N-dimensional space. PCA looks for new dimensions that better fit these points. In more mathematical terms, it maps possibly correlated variables into a smaller set of uncorrelated variables, which are called principal components. \fig{pcapic} shows an example of how two dimensions can be approximated in a single new dimension feature, as seen by the dashed line. \begin{figure}[h] \begin{center} \includegraphics[width=2.25in]{pca.png} \end{center} \caption{A PCA dimension feature.}\label{fig:pcapic} \end{figure} TF-IDF, or term frequency times inverse document frequency, reduces the number of terms by describing how important a term is in a document (or collection of documents) by incrementing its importance according to how many times the term appears in a document. However, this importance is also offset by the frequency of the term in the entire corpus. Thus, we are concerned with only terms that occur frequently in a small set of documents, and very infrequently everywhere else. To calculate the Tf*IDF value for each term in a document, we use the following equation: \begin{equation} Tf*df(t,D_j) = \frac{tf(t_i,D_j)}{|D_j|}log(\frac{|D|}{df(t_i)}) \end{equation} To reduce all terms (and thus, dimensions), we must find the sum of the above \begin{equation} Tf*Ifd_{sum}(t) = \sum_{D_j\epsilon D}Tf*Idf(t,D_j) \end{equation} \subsubsection{Building the Experiment} This experiment was conducted entirely with OURMINE using a collection of BASH scripts, as well as custom Java code. The framework was built as follows: \begin{enumerate} \item{A command-line API was developed in Java for parsing the data, reducing/clustering the data, and outputting the data. Java was chosen due to its preferred speed for the execution of computationaly expensive instructions.} \item{The data was then iteratively loaded into this Java code via shell scripting. This provides many freedoms, such as allowing parameters to be altered as desired, as well as outputting any experimental results in any manner seen fit.} \end{enumerate} \fig{clusterworker} shows the OURMINE code for clustering data using the K-means algorithm. Shell scripting provides us with much leverage in this example. For instance, by looking at Lines 2-5, we can see that by passing the function four parameters, we can cluster data in the range from {\em minK} to {\em maxK} on all data in {\em dataDir}. This was a powerful feature used in this experiment, because it provides the oppurtunity to run the clusterer across multiple machines simultaneously. As a small example, suppose we wish to run K-means across three different machines, with a minimum {\em K} of 2 and a maximum {\em K} of 256. Since larger values of {\em K} generally yield longer runtimes, we may wish to distribute the execution as follows: \begin{verbatim} Machine 1: clusterKmeansWorker 256 256 0 dataDir Machine 2: clusterKmeansWorker 64 128 2 dataDir Machine 3: clusterKmeansWorker 2 32 2 dataDir \end{verbatim} Lines 9-13 of \fig{clusterworker} load the data from {\em dataDir} for every {\em k}, and formats the name of the output file. Then, lines 15-19 begin the timer, cluster the data, and output statistical information such as {\em k}, the dataset, and runtime of the clusterer on that dataset. This file will then be used later in the analysis of these clusters. \begin{figure}[h] ~\hrule~ {\scriptsize\begin{verbatim} 1 clusterKmeansWorker(){ 2 local minK=$1 3 local maxK=$2 4 local incVal=$3 5 local dataDir=$4 6 local stats="clusterer,k,dataset,time(seconds)" 7 local statsfile=$Save/kmeans_runtimes 8 echo $stats >> $statsfile 9 for((k=$minK;k<=$maxK;k*=$incVal)); do 10 for file in $dataDir/*.arff; do 11 filename=`basename $file` 12 filename=${filename%.*} 13 out=kmeans_k="$k"_$filename.arff 14 echo $out 15 start=$(date +%s.%N) 16 $Clusterers -k $k $file $Save/$out 17 end=$(date +%s.%N) 18 time=$(echo "$end - $start" | bc) 19 echo "kmeans,$k,$filename,$time" >> $statsfile 20 done 21 done 22 }\end{verbatim}} ~\hrule~ \caption{An OURMINE worker function to cluster data using the K-means algorithm. Note that experiments using other clustering methods (such as GenIc and Canopy), could be conducted by calling line 16 above in much the same way, but with varying flags to represent the clusterer.}\label{fig:clusterworker} \end{figure} Similarly, the flags in line 16 can be changed to perform a different action, such as clustering using GenIc or Canopy, by changing {\em -k} to {\em -g} or {\em -c} respectively, as well as finding cluster similarities and purities (as described below), by using {\em -sim} and {\em -purity} as inputs. Since any number of variables can be set to represent different libraries elsewhere in OURMINE, the variable \begin{verbatim} $Reducers \end{verbatim} is used for the dimensionality reduction of the raw dataset, as seen in \fig{tfidfworker}, whose overall structure is very similar to \fig{clusterworker}. \begin{figure}[h] ~\hrule~ {\scriptsize\begin{verbatim} 1 reduceWorkerTfidf(){ 2 local datadir=$1 3 local minN=$2 4 local maxN=$3 5 local incVal=$4 6 local outdir=$5 7 local runtimes=$outdir/tfidf_runtimes 8 for((n=$minN;n<=$maxN;n+=$incVal)); do 9 for file in $datadir/*.arff; do 10 out=`basename $file` 11 out=${out%.*} 12 dataset=$out 13 out=tfidf_n="$n"_$out.arff 14 echo $out 15 start=$(date +%s) 16 $Reducers -tfidf $file $n $outdir/$out 17 end=$(date +%s) 18 time=$((end - start)) 19 echo "tfidf,$n,$dataset,$time" >> $runtimes 20 done 21 done 22 }\end{verbatim}} ~\hrule~ \caption{An OURMINE worker function to reduce the data using TF-IDF.}\label{fig:tfidfworker} \end{figure} \subsection{Results} To determine the overall benefits of each clustering method, this experiment used both cluster similarities and cluster purities. \subsubsection{Purities} When we cluster data that already has classes available to us, we can determine cluster purity. A cluster is considered relatively ``pure'' if it contains a high percentage of points that share the same class. Thus, we can say that this is a measure of the ``{\em quality} of a clustering solution''\footnote{\url{http://www.cse.iitm.ac.in/~cs672/purity.pdf}}. To determine the purity of individual clusters, the equation \begin{equation} purity(C_a)=\frac{1}{|C_a|}}*max(|C_a|_{cl=b}) \end{equation} was used, where a cluster $C_a$ contains $|C_a|_{cl=b}$ number of points assigned to class $b$. However, a high purity using this equation always is easy to obtain. For instance, if a cluster contains only one instance from the dataset, the purity of that cluster is always 1. Therefore, this study uses a weighted sum of these purities \begin{equation} purity=\sum_{a=1}^k\frac{|C_a|}{|D|}*purity(C_a) \end{equation} where $|D|$ denotes the number of documents in the dataset. To determine cluster purities, the following text datasets were used in conjunction with their natural classes: \begin{itemize} \item{BBC: 5 natural classes (business, entertainment, politics, sport, tech)} \item{BBCSPORT: 5 natural classes (athletics, cricket, football, rugby, tennis)\footnote{\url{http://mlg.ucd.ie/datasets/bbc.html}} \end{itemize} The results of the cluster purity experiments are shown in \fig{pur}. By examining these results, we can see that the more complete methods (in this case K-means) almost always yields a higher overall weighted purity per cluster. However, since each data set and its class distribution can drastically alter the outcome of these values, it is sometimes beneficial to use fast heuristic methods, as seen in the results for the BBC data set. \begin{figure}[h] \begin{center} \includegraphics[width=4.50in]{bbc_purities.pdf} \includegraphics[width=4.50in]{bbcsport_purities.pdf} \end{center} \caption{Normalized weighted purity sums of clusters, separated by data set. Note that using data set BBCSport, K-means' clusters clearly contain a higher weighted purity. However, for BBC data, GenIc follows K-means closely, while Canopy remains relatively low, but steady, as a result of documents being shared by neighboring canopies.}\label{fig:pur} \end{figure} \subsubsection{Similarities} Cluster similarities tell us how similar points are in a cluster, either within a cluster ({\em Intra}similarity), or with members of other clusters ({\em Inter}similarity). The idea here is simple: gauge how well a clustering algorithm groups similar documents, and how well it separates different documents. \begin{figure}[h] \centering \small \begin{tabular}{| c | c | c | c | c |} \hline Reducer and Clusterer & Time & InterSim & IntraSim & Gain \\ \hline TF-IDF*K-means & 17.52 & -0.085 & 141.73 & 141.82 \\ \hline TF-IDF*GenIc & 3.75 & -0.14 & 141.22 & 141.36 \\ \hline PCA*K-means & 100.0 & 0.0 & 100.0 & 100.0 \\ \hline PCA*Canopy & 117.49 & 0.00 & 99.87 & 99.87 \\ \hline PCA*GenIc & 11.71 & -0.07 & 99.74 & 99.81 \\ \hline TF-IDF*Canopy & 6.58 & 5.02 & 93.42 & 88.4 \\ \hline \end{tabular} \caption{Similarity values are normalized according to the combination of most rigorous reducer and clusterer. Note that most heuristic methods (such as TF-IDF*GenIc) either yield a gain larger than thorough ones, or result in negligable differences based on their execution time.} \label{fig:sims} \end{figure} \section{Discussion} \section{Conclusions} \bibliographystyle{plain} \bibliography{refs} \appendix \section*{Installing OURMINE} OURMINE is an open source tool licenses under GPL 3.0. It can be downloaded from \url{http://unbox.org/wisp/trunk/our/INSTALL}. As OURMINE is a command-line tool, the system requirements are insignificant. However, there are a few things that are necessary before installing OURMINE. \bi \item A Unix-based platform. OURMINE is designed to work in the BASH shell, and will not operate on a Windows system. If no other platform is available, a BASH emulator like Cygwin will need to be installed before using OURMINE. Users running any Linux distribution, BSD, or Mac OS X can run OURMINE natively. \item The Java Runtime Environment. Most computers will already have this installed. The ability to run Java programs is required for the WEKA learners. \item The GAWK Programming Language. Many of the scripts within OURMINE are written using GAWK. Any up-to-date Linux system will already have GAWK installed. Cygwin or Mac OS X users will need to install it themselves. \ei To install OURMINE, follow these instructions: \bi \item Go to a temporary directory \item wget -q -O INSTALL \newline http://unbox.org/wisp/trunk/our/INSTALL \item bash INSTALL \ei OURMINE is installed to the \$HOME/opt directory by default. To run OURMINE, simply move into that directory and type {\em bash our minerc}. This will launch the OURMINE shell. OURMINE has several demos included to familiarize you with the built-in functionality. These demos may be run by typing {\em demoX} into the command prompt, where {\em X} is a number between 3 and 19. To look at the source code for that demo, type {\em show demoX}. It is highly recommended that new users run each demo and take a close look at its source code. This {\em show} command works for any function in OURMINE, not just the demos. \begin{figure}[h] \small %\begin{center} \begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}|c} %\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5} %&min& & median & &max\\\cline{3-5} \multicolumn{2}{c}{~}&\multicolumn{3}{c}{pd}&2nd quartile\\ \multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}&median,\\\cline{3-5} Rank&Treatment& 25\%& 50\% & 75\%&3rd quartile\\\hline 1&WC kNN/nb & 66& 73& 80& \boxplot{0.0}{65.8}{72.9}{80.0}{2.8} \\ 2&CC kNN/nb & 57& 71& 83& \boxplot{0.0}{56.7}{70.6}{82.9}{4.8} \\ 2&WC nb & 59& 69& 83& \boxplot{0.0}{59.0}{69.4}{82.5}{6.8} \\ 3&CC nb & 49& 66& 87& \boxplot{0.0}{48.8}{65.8}{86.5}{7.5} \\\hline \multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 \end{tabular} %\end{center} \caption{Probability of Detection (PD) results, sorted by median values. }\label{fig:pds} \end{figure} \begin{figure}[h] \small %\begin{center} \begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}|c} %\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5} %&min& & median & &max\\\cline{3-5} \multicolumn{2}{c}{~}&\multicolumn{3}{c}{pf}&2nd quartile\\ \multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}&median,\\\cline{3-5} Rank&Treatment& 25\%& 50\% & 75\%&3rd quartile\\\hline 1&CC kNN/nb & 17& 29& 43& \boxplot{0.0}{17.1}{29.1}{43.3}{2.8} \\ 1&CC nb & 13& 34& 51& \boxplot{0.0}{13.4}{34.1}{51.1}{4.8} \\ 2&WC nb & 17& 30& 41& \boxplot{0.0}{17.2}{30.4}{40.9}{6.8} \\ 3&WC kNN/nb & 20& 27& 34& \boxplot{0.0}{20.0}{27.0}{34.2}{7.5} \\\hline \multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 \end{tabular} %\end{center} \caption{Probability of False Alarm (PF) results, sorted by median values. }\label{fig:pfs} \end{figure} %\begin{thebibliography}{10} %\end{thebibliography} \end{document}