=head1 About this document

Usecases - for interaction with learners

=head2 Motivation

Drew DeHaas asked for an overview of interactions between the WVU data miners and the Grammatech systems.

=head2 Version History

=over 4

=item *  

Oct 13: created by Tim Menzies, WVU

=back 


=head1 Summary

At this stage, the space of possible interactions is still 
formative. So this will be a living document (with many expected
revisions).

=head2 Users

We foresee two kinds of users of this tool: sysadmin (S) and business
user (B). All the following headings are annotated with who we suspect
will want to access these tasks.

=head2 Tasks

The tool's tasks fall into several modes: 
I<commissioning>, I<management>, and I<low-level>.
In summary:

=over 4

=item Commissioning mode

Initially, the system starts in I<commissioning> mode, then
tries to  build itself up
to some competency.

=item  Management mode

In I<management> mode, the system has reached a useful level of
performance, and business users use it to make decisions
about their source code.

=item Under-the-hood

Meanwhile, I<under-the-hood>, are a set of structures and services
that support commissioning and management.

=back

The following nodes will start with I<under-the-hood> since that
will define a background setting for the rest.


=head1 Basics

Regardless of the data miner used in this work, much of the following
will be required.

=head2 Basic Data Creation [B,S]

Prior to importing data, someone specifies a space of independent
measures. Each measure must be identified as I<discrete> or
I<numeric> and be attached to some feature extractor.

Similarly, users  must identify some dependent measure (e.g. 
number of bugs). 

=head2 Basic Data Import [B,S]

Here, data from Grammatech's parsing tools
are joined into one large table with the following properties:

=over 4

=item * 

Columns are marked I<discrete> or I<numeric>.

=item * 

One column is marked as the I<goal>. 
Currently, this column must be I<discrete>.

=item * 

Each row contains I<cells> (the actual data) and a I<sort key>. 
Sort keys are discussed below- see L<cross-val|cross_validation>.

=back


Once the data structure is defined, the data is imported. We 
foresee at least two kind of import:

=over 4

=item Bulk

Just a drop of all data- useful for testing the system.

=item Reference

Importing some reference data relating to 
systems with known properties; e.g. very-buggy, bug-free, hard-to-integrate, easy-to-integrate, etc.

=item Test

Importing some local data to be compared to the reference data.
After importing test data, users can see where they stand w.r.t.
the reference data.

=back

=head2 Basic Discretization 

=head3 Dependent Variable Discretization [S]

Currently, all numeric dependent variables must be converted 
to a discrete values. There are many ways to do this. Given
an array I<B> of break points and an array I<C> of counts in each break,
then:

=over 4

=item *

I<Equal Frequency Discretization> means I<C[i]=C[i+1]>.

=item *

I<Equal Width Discretization> means I<B[i] - B[i+1] = B[j] - B[j+1]>

=item *

Log all values (adding a very small number to the zero values), then
apply any of the above.

=item *

I<Labeling> replaces class values with a few symbols.
For example, all rows with I<bugs=0> might be labeled I<good>
and all other rows I<bad>.

=back

As to the issue of how many divisions to make in a column,
standard practice is to set the number of bins equal to the
square root of the number of rows.

=head3 Independent Variable Discretization [S]

Currently, it is not clear if the independent measures will require
Discretization. However, if they do, then all the above discretizers
might be required.

Also, when discretizing independent measures, it can be useful
to sort all rows on that  measure, then explore each measure's value
I<along side the class symbol> for that row. This is called 
I<supervised Discretization> (and the above discretizers are 
I<unsupervised> since they make no reference to the class symbol).

There are many supervised discretizers, the standard one being
to find a value in the a measure's column that divides the column
into I<N1> rows above and I<N2> rows below. 

=over 4

=item *

If the frequency
of the classes I<a,b,c,..> in the N1 division is  I<F(a),F(b),F(c),...> 

=item *

Then the probability of classes in that division is I<P(x)=F(x)/N1>.

=item *

The number of bits required to encode that division is the standard
entropy calculation I<E1 = sum(-1*P(x)*log2(P(x)))>.

=item *

The  best division minimizes I<E1*N1 + E2*N2>.

=item *

The process is then called recursively on each division.

=back


=head2 Basic Testing [S,B]

=head3 Cross-Validation 

The sort key will be used for building stratified test sets. If each
class is given a unique integer value, then a row's sort key
is that integer value plus a random number in the range zero to 0.5.
When sorted on this key, all things of the same class will be sorted
together, in some random order. Hence, to build N-way train:test
sets:

=over 4

=item  *

Sort on the I<sort key>;

=item  * 

For each row, in order, place every N-th row in the test
set and all other rows in the training set.

=back

The resulting train/test sets will have class distributions
similar to the original data sets.

=head3 Incremental Validation 

Cross-val is a batch testing procedure which is useful for evaluating
(say) different magic settings to a learner. However, it does not
emulate some user community working through their data in some
time series (e.g. when some new value arrives).

To emulate that process, the standard incremental rig is

=over 4

=item *

Read the time series data in I<eras> of size I<E>.
A special case is I<E=1>; i.e. after every new example,
the learners reflect over the data.

=item *

Test I<era[i]> using the model built from I<era[0...i-1]> and
update any performance scores.

=item *

Then update the I<era[i]> model using data from I<era[i]>.

=back

Note that this test ensures that the performance scores
are evaluated on data not used during training.

One interesting issue with incremental validation is "how to collect
the data for I<era[i+1]>. We return to this point below, see "Selection Strategies".

=head2 Basic Classification [S,B]

Classification means take a row, temporarily forgetting  its 
I<actual> dependent
measure,
then asking a data miner to I<predict> the dependent measure.

=head2 Basic Reporting [S,B]

This system maintains separate results statistics for every class in
the system. 

   (defstruct result  
              target                  ; name of class
              (a 0) (b 0) (c 0) (d 0) ; basic counts
              acc pf prec pd f        ; derived counts
   )

Every time we have a new pair
of I<actual, predicted> then we must update the performance statistics
for each class.  

   for className, resultsStructForClass in all classResults do
     (with-slots (a b c d) resultsStructForClass
       (if (eql actual className)
	   (if (eql predicted actual)
	       (incf d)
	       (incf b))
	   (if (eql predicted className)
	       (incf c)
	       (incf  a)))))

After making predictions for all rows, then the statistics for all
classes are computed as follows:

    (dolist (one (results-structs-for-all-classes))
       (with-slots ( a b c d acc pf prec pd f) result
         (let ((zip (float (expt 10 -16)))) ; stop divide-by-one error
           (setf acc  (/ (+ a d) (+ zip a b c d)) ; accuracy
                 pf   (/ c       (+ zip a c    )) ; prob. detection
                 prec (/ d       (+ zip c d    )) ; precision
                 pd   (/ d       (+ zip b d    )) ; prob. false alarm
                 f    (/ (* 2 prec pd) (+ zip prec pd))) ; f-measure

=head1 Commissioning [S]

Recall from the above that the goal of commissioning is to bootstrap
a data miner into competency.

In L<incremental validation|incremental_validation>, after some
work has update the current model, then more examples are gathered
for the next era's analysis. The trick in commissioning these systems
is to find the least number of most informative examples for the
next era that will most improve the system.
This task is performed by the I<selection strategies>.

=head2 Selection Strategies


=head3 Straw Man (random) Selection

Everything must beat the straw man that selects randomly.

=head3 User-directed Selection

There are many user heuristics for selecting what might possibly
be a problematic example. The simplest of these comes from Gunes Koru
(UMBC) who argues for:

=over 4


=item  Theory of Relative Defect Proneness (RDP)

In large-scale software systems, smaller modules will be proportionally more defect prone compared to larger ones
(see the March/April 2010 issue of the IEEE
Software, pp. 81-89).

=item Theory of Relative Dependency (RD)

In large-scale software systems, smaller modules will be proportionally more dependent compared to larger ones
(see
the IEEE TSE April March/April 2009 issue,  then read the paper in
the Journal of Empirical Software Engineering (ESE) published in
their September/October 2008 issue).  
Historical note: RD is  an explanatory theory for RDP
and  was uncovered later.

=back

(There is also another ESE paper coming up soon which
validates the Theory of RDP for closed-source software products.
At this point, it would not be inaccurate to state that there is
compelling evidence to support the theories.)

Based on RDP and RD, a simple user-directed strategy would
be to sort the examples in order of size, then explore them in
the order smallest to largest.

=head3 Informed Selection

Informed selection is controlled by the active learners. Details, TBD.

=head1 Monitoring [B]

=head2 Prediction

Finally, the system is running and users ask it "what modules
should I look at next."

=head2 Repair

If the system starts getting it wrong (i.e. the modules predicted
to have errors actually do not have errors), then we switch
back to I<commissioning> mode.