Setup ----- - Install weka from http://www.cs.waikato.ac.nz/ml/weka/ - Update code/filter.pl, set the variable $javaClassPath at the top of the file to point to the installation location of weka.jar. Running ------- - The system is a set of scripts that are run using a make file. The typical sequence of commands to run is as follows: make clean-all make filter-data make resample-data make perform-test make class-stats make tex-output - There is also a shell script called runTests.sh which puts the system through a number of runs using different thresholds. This is the script I used to produce the latest data sets for Tim. On my system this script takes about four hours to run. Overview of the system: ----------------------- - This system is complex and I will not attempt to describe everything. I anticipate that it will take time to understand the details and that you will have to ask me questions from time to time. - "make clean-all". - This removes files from all data folders. Only do this if you are certain that you no longer need the data. - "make filter-data" - This filters all files from the raw-data folder into the filtered-data folder. - In this process all low PD classes are removed from the data. - The script filter.pl contains the details of how this is done. - "make resample-data" - This resamples the filtered data into the test-data folder. - Random sampling with replacement is used to produce 10 sets of test and training data for each filtered data file. - For each set of files, there is a pair of test and training files per class in the filtered data file. - The script resample.pl contains details of how this is done. - "make perform-test" - This actually run the modified sawtooth.awk script on each of the resampled data pairs. - The script sawtooth.awk for the most part is not my code and I have not gone out of my way to clean it up. The majority of my work is within the "Pass==2" section. - The data produced is written to the test-results folder. - "make class-stats" - This combines the test results for each data set together and writes the results to the final-stats folder. - There is one CSV per data set which summarizes the results of analysis for each class of the data set, and then the entire data set. - "make tex-output" - This produces the final latex output, suitable for inclusion in a tex document.