=head1 About this document Usecases - for interaction with learners =head2 Motivation Drew DeHaas asked for an overview of interactions between the WVU data miners and the Grammatech systems. =head2 Version History =over 4 =item * Oct 13: created by Tim Menzies, WVU =back =head1 Summary At this stage, the space of possible interactions is still formative. So this will be a living document (with many expected revisions). =head2 Users We foresee two kinds of users of this tool: sysadmin (S) and business user (B). All the following headings are annotated with who we suspect will want to access these tasks. =head2 Tasks The tool's tasks fall into several modes: I, I, and I. In summary: =over 4 =item Commissioning mode Initially, the system starts in I mode, then tries to build itself up to some competency. =item Management mode In I mode, the system has reached a useful level of performance, and business users use it to make decisions about their source code. =item Under-the-hood Meanwhile, I, are a set of structures and services that support commissioning and management. =back The following nodes will start with I since that will define a background setting for the rest. =head1 Basics Regardless of the data miner used in this work, much of the following will be required. =head2 Basic Data Creation [B,S] Prior to importing data, someone specifies a space of independent measures. Each measure must be identified as I or I and be attached to some feature extractor. Similarly, users must identify some dependent measure (e.g. number of bugs). =head2 Basic Data Import [B,S] Here, data from Grammatech's parsing tools are joined into one large table with the following properties: =over 4 =item * Columns are marked I or I. =item * One column is marked as the I. Currently, this column must be I. =item * Each row contains I (the actual data) and a I. Sort keys are discussed below- see L. =back Once the data structure is defined, the data is imported. We foresee at least two kind of import: =over 4 =item Bulk Just a drop of all data- useful for testing the system. =item Reference Importing some reference data relating to systems with known properties; e.g. very-buggy, bug-free, hard-to-integrate, easy-to-integrate, etc. =item Test Importing some local data to be compared to the reference data. After importing test data, users can see where they stand w.r.t. the reference data. =back =head2 Basic Discretization =head3 Dependent Variable Discretization [S] Currently, all numeric dependent variables must be converted to a discrete values. There are many ways to do this. Given an array I of break points and an array I of counts in each break, then: =over 4 =item * I means I. =item * I means I =item * Log all values (adding a very small number to the zero values), then apply any of the above. =item * I replaces class values with a few symbols. For example, all rows with I might be labeled I and all other rows I. =back As to the issue of how many divisions to make in a column, standard practice is to set the number of bins equal to the square root of the number of rows. =head3 Independent Variable Discretization [S] Currently, it is not clear if the independent measures will require Discretization. However, if they do, then all the above discretizers might be required. Also, when discretizing independent measures, it can be useful to sort all rows on that measure, then explore each measure's value I for that row. This is called I (and the above discretizers are I since they make no reference to the class symbol). There are many supervised discretizers, the standard one being to find a value in the a measure's column that divides the column into I rows above and I rows below. =over 4 =item * If the frequency of the classes I in the N1 division is I =item * Then the probability of classes in that division is I. =item * The number of bits required to encode that division is the standard entropy calculation I. =item * The best division minimizes I. =item * The process is then called recursively on each division. =back =head2 Basic Testing [S,B] =head3 Cross-Validation The sort key will be used for building stratified test sets. If each class is given a unique integer value, then a row's sort key is that integer value plus a random number in the range zero to 0.5. When sorted on this key, all things of the same class will be sorted together, in some random order. Hence, to build N-way train:test sets: =over 4 =item * Sort on the I; =item * For each row, in order, place every N-th row in the test set and all other rows in the training set. =back The resulting train/test sets will have class distributions similar to the original data sets. =head3 Incremental Validation Cross-val is a batch testing procedure which is useful for evaluating (say) different magic settings to a learner. However, it does not emulate some user community working through their data in some time series (e.g. when some new value arrives). To emulate that process, the standard incremental rig is =over 4 =item * Read the time series data in I of size I. A special case is I; i.e. after every new example, the learners reflect over the data. =item * Test I using the model built from I and update any performance scores. =item * Then update the I model using data from I. =back Note that this test ensures that the performance scores are evaluated on data not used during training. One interesting issue with incremental validation is "how to collect the data for I. We return to this point below, see "Selection Strategies". =head2 Basic Classification [S,B] Classification means take a row, temporarily forgetting its I dependent measure, then asking a data miner to I the dependent measure. =head2 Basic Reporting [S,B] This system maintains separate results statistics for every class in the system. (defstruct result target ; name of class (a 0) (b 0) (c 0) (d 0) ; basic counts acc pf prec pd f ; derived counts ) Every time we have a new pair of I then we must update the performance statistics for each class. for className, resultsStructForClass in all classResults do (with-slots (a b c d) resultsStructForClass (if (eql actual className) (if (eql predicted actual) (incf d) (incf b)) (if (eql predicted className) (incf c) (incf a))))) After making predictions for all rows, then the statistics for all classes are computed as follows: (dolist (one (results-structs-for-all-classes)) (with-slots ( a b c d acc pf prec pd f) result (let ((zip (float (expt 10 -16)))) ; stop divide-by-one error (setf acc (/ (+ a d) (+ zip a b c d)) ; accuracy pf (/ c (+ zip a c )) ; prob. detection prec (/ d (+ zip c d )) ; precision pd (/ d (+ zip b d )) ; prob. false alarm f (/ (* 2 prec pd) (+ zip prec pd))) ; f-measure =head1 Commissioning [S] Recall from the above that the goal of commissioning is to bootstrap a data miner into competency. In L, after some work has update the current model, then more examples are gathered for the next era's analysis. The trick in commissioning these systems is to find the least number of most informative examples for the next era that will most improve the system. This task is performed by the I. =head2 Selection Strategies =head3 Straw Man (random) Selection Everything must beat the straw man that selects randomly. =head3 User-directed Selection There are many user heuristics for selecting what might possibly be a problematic example. The simplest of these comes from Gunes Koru (UMBC) who argues for: =over 4 =item Theory of Relative Defect Proneness (RDP) In large-scale software systems, smaller modules will be proportionally more defect prone compared to larger ones (see the March/April 2010 issue of the IEEE Software, pp. 81-89). =item Theory of Relative Dependency (RD) In large-scale software systems, smaller modules will be proportionally more dependent compared to larger ones (see the IEEE TSE April March/April 2009 issue, then read the paper in the Journal of Empirical Software Engineering (ESE) published in their September/October 2008 issue). Historical note: RD is an explanatory theory for RDP and was uncovered later. =back (There is also another ESE paper coming up soon which validates the Theory of RDP for closed-source software products. At this point, it would not be inaccurate to state that there is compelling evidence to support the theories.) Based on RDP and RD, a simple user-directed strategy would be to sort the examples in order of size, then explore them in the order smallest to largest. =head3 Informed Selection Informed selection is controlled by the active learners. Details, TBD. =head1 Monitoring [B] =head2 Prediction Finally, the system is running and users ask it "what modules should I look at next." =head2 Repair If the system starts getting it wrong (i.e. the modules predicted to have errors actually do not have errors), then we switch back to I mode.