We foresee two kinds of users of this tool: sysadmin (S) and business user (B). All the following headings are annotated with who we suspect will want to access these tasks.

Tasks

The tool's tasks fall into several modes: commissioning, management, and low-level. In summary:

Commissioning mode: Initially, the system starts in commissioning mode, then tries to build itself up to some competency.
Management mode: In management mode, the system has reached a useful level of performance, and business users use it to make decisions about their source code.
Under-the-hood: Meanwhile, under-the-hood, are a set of structures and services that support commissioning and management.

The following nodes will start with under-the-hood since that will define a background setting for the rest.

Basics

Regardless of the data miner used in this work, much of the following will be required.

Basic Data Creation [B,S]

Prior to importing data, someone specifies a space of independent measures. Each measure must be identified as discrete or numeric and be attached to some feature extractor.

Similarly, users must identify some dependent measure (e.g. number of bugs).

Basic Data Import [B,S]

Here, data from Grammatech's parsing tools are joined into one large table with the following properties:

Columns are marked discrete or numeric.
One column is marked as the goal. Currently, this column must be discrete.
Each row contains cells (the actual data) and a sort key. Sort keys are discussed below- see cross-val.

Once the data structure is defined, the data is imported. We foresee at least two kind of import:

Bulk: Just a drop of all data- useful for testing the system.
Reference: Importing some reference data relating to systems with known properties; e.g. very-buggy, bug-free, hard-to-integrate, easy-to-integrate, etc.
Test: Importing some local data to be compared to the reference data. After importing test data, users can see where they stand w.r.t. the reference data.

Basic Discretization

Dependent Variable Discretization [S]

Currently, all numeric dependent variables must be converted to a discrete values. There are many ways to do this. Given an array B of break points and an array C of counts in each break, then:

Equal Frequency Discretization means C[i]=C[i+1].
Equal Width Discretization means B[i] - B[i+1] = B[j] - B[j+1]
Log all values (adding a very small number to the zero values), then apply any of the above.
Labeling replaces class values with a few symbols. For example, all rows with bugs=0 might be labeled good and all other rows bad.

As to the issue of how many divisions to make in a column, standard practice is to set the number of bins equal to the square root of the number of rows.

Independent Variable Discretization [S]

Currently, it is not clear if the independent measures will require Discretization. However, if they do, then all the above discretizers might be required.

Also, when discretizing independent measures, it can be useful to sort all rows on that measure, then explore each measure's value along side the class symbol for that row. This is called supervised Discretization (and the above discretizers are unsupervised since they make no reference to the class symbol).

There are many supervised discretizers, the standard one being to find a value in the a measure's column that divides the column into N1 rows above and N2 rows below.

If the frequency of the classes a,b,c,.. in the N1 division is F(a),F(b),F(c),...
Then the probability of classes in that division is P(x)=F(x)/N1.
The number of bits required to encode that division is the standard entropy calculation E1 = sum(-1*P(x)*log2(P(x))).
The best division minimizes E1*N1 + E2*N2.
The process is then called recursively on each division.

Basic Testing [S,B]

Cross-Validation

The sort key will be used for building stratified test sets. If each class is given a unique integer value, then a row's sort key is that integer value plus a random number in the range zero to 0.5. When sorted on this key, all things of the same class will be sorted together, in some random order. Hence, to build N-way train:test sets:

Sort on the sort key;
For each row, in order, place every N-th row in the test set and all other rows in the training set.

The resulting train/test sets will have class distributions similar to the original data sets.

Incremental Validation

Cross-val is a batch testing procedure which is useful for evaluating (say) different magic settings to a learner. However, it does not emulate some user community working through their data in some time series (e.g. when some new value arrives).

To emulate that process, the standard incremental rig is

Read the time series data in eras of size E. A special case is E=1; i.e. after every new example, the learners reflect over the data.
Test era[i] using the model built from era[0...i-1] and update any performance scores.
Then update the era[i] model using data from era[i].

Note that this test ensures that the performance scores are evaluated on data not used during training.

One interesting issue with incremental validation is "how to collect the data for era[i+1]. We return to this point below, see "Selection Strategies".

Basic Classification [S,B]

Classification means take a row, temporarily forgetting its actual dependent measure, then asking a data miner to predict the dependent measure.

Basic Reporting [S,B]

This system maintains separate results statistics for every class in the system.

   (defstruct result  
              target                  ; name of class
              (a 0) (b 0) (c 0) (d 0) ; basic counts
              acc pf prec pd f        ; derived counts
   )

Every time we have a new pair of actual, predicted then we must update the performance statistics for each class.

   for className, resultsStructForClass in all classResults do
     (with-slots (a b c d) resultsStructForClass
       (if (eql actual className)
           (if (eql predicted actual)
               (incf d)
               (incf b))
           (if (eql predicted className)
               (incf c)
               (incf  a)))))

After making predictions for all rows, then the statistics for all classes are computed as follows:

    (dolist (one (results-structs-for-all-classes))
       (with-slots ( a b c d acc pf prec pd f) result
         (let ((zip (float (expt 10 -16)))) ; stop divide-by-one error
           (setf acc  (/ (+ a d) (+ zip a b c d)) ; accuracy
                 pf   (/ c       (+ zip a c    )) ; prob. detection
                 prec (/ d       (+ zip c d    )) ; precision
                 pd   (/ d       (+ zip b d    )) ; prob. false alarm
                 f    (/ (* 2 prec pd) (+ zip prec pd))) ; f-measure

Commissioning [S]

Recall from the above that the goal of commissioning is to bootstrap a data miner into competency.

In incremental validation, after some work has update the current model, then more examples are gathered for the next era's analysis. The trick in commissioning these systems is to find the least number of most informative examples for the next era that will most improve the system. This task is performed by the selection strategies.

Selection Strategies

Straw Man (random) Selection

Everything must beat the straw man that selects randomly.

User-directed Selection

There are many user heuristics for selecting what might possibly be a problematic example. The simplest of these comes from Gunes Koru (UMBC) who argues for:

Theory of Relative Defect Proneness (RDP): In large-scale software systems, smaller modules will be proportionally more defect prone compared to larger ones (see the March/April 2010 issue of the IEEE Software, pp. 81-89).
Theory of Relative Dependency (RD): In large-scale software systems, smaller modules will be proportionally more dependent compared to larger ones (see the IEEE TSE April March/April 2009 issue, then read the paper in the Journal of Empirical Software Engineering (ESE) published in their September/October 2008 issue). Historical note: RD is an explanatory theory for RDP and was uncovered later.

(There is also another ESE paper coming up soon which validates the Theory of RDP for closed-source software products. At this point, it would not be inaccurate to state that there is compelling evidence to support the theories.)

Based on RDP and RD, a simple user-directed strategy would be to sort the examples in order of size, then explore them in the order smallest to largest.

Informed Selection

Informed selection is controlled by the active learners. Details, TBD.

Monitoring [B]

Prediction

Finally, the system is running and users ask it "what modules should I look at next."

Repair

If the system starts getting it wrong (i.e. the modules predicted to have errors actually do not have errors), then we switch back to commissioning mode.