Beyond Mining Repositories

What do you see?
Why AWK?
Not Just AWK
And Finally...
Notes

XXXX as with all things, Holte and Rams. 1r cs C4.5 GAs vs TEAC which vs other learners and the koru heuristics tdidf vs pca genic vs k-means tar3 vs standard optimizers

Goals

Q1:So, now you can mine software repositories. What else can you do?

A1: everything.
Data mining is about discovery.
Rational being use discovery every day, in many ways.
Study data mining to understand rationality (both it's application and it's mistakes)

Q2: Is too much data mining a bad thing (torturing the data)?

A2: Yes indeed, mistakes can be made
But learning is hard- all learning agents (including humans) make mistakes.
What we need are:
- Continuous verification methods for our learners.
- A community of learners, shedding insights on each other.

Q3: Is data mining a complex task?

A3: Yup. Lots of ways to do it.
A4: Nope. There are shortcuts

In this talk, we expand on A1, A2, A3, A4.

TOE = A Theory of Everything

Let $T$ , $A$ , $G$ be a theory, goals, and some assumptions.

Let there there exist some $noGood(A' \subseteq A)$ predicate that can report inconsistent assumptions e.g.:

$noGood(day=monday \wedge day=tuesday)$
or, perhaps more generally $noGood(X=V_1 \wedge X=V_2) \leftarrow V_1\not= V_2$

Find the assumptions that satisfy rules one and two:

$T\sqcup A\vdash G$ (do something)
$T\sqcup A\;\not\vdash \perp$ (don't screw up)

What can we do with the above? Well....

Induction

If " $\sqcup$ " assert and retract parts of the theory $T$ [Hirata95, Flach97].

Deduction

Ignore rule2; when the assumptions $A$ are known inputs; then and the goals $G$ are the reachable outputs condonned by the theory $T$ .

Abduction

Find all solutions to the above two equations and sort them into worlds $W$ , of belief:

Each world contains subsets:
- $T_i' \subseteq T$
- $A_i' \subseteq A$
- $G_i' \subseteq G$
Each world $W_i$ , is consistent w.r.t. $nogood$ ;
Each world $W_i$ , is maximal; i.e. does not contain a smaller world w.r.t. .

(BTW, [Poole94] calls these worlds "scenarios" while [DeKleer86] called them "envisonments" ).

The "best" worlds (that will will use, somehow) are selected by $best$ ;

So $\left(W' \subseteq W\right) =best(W)$ ;

By selecting different $best$ operators, we can implement prediction, verification, classification, diagnosis, probing, planning, explanation, ...

Prediction

Prediction is deduction in each world.

Verification

Also, if the goals $G$ are known observations then verification returns the world $W_i$ with greatest number of known observations i.e. $|G_i|$ [Menzies96].

Classification

Classification is just prediction where $T$ contains reserved atoms tagged as classifications just deduction in each world (runs fast, see above).

Diagnosis

Diagnosis is selecting the world(s) that includes:

all the observed faults (set-covering diagnosis)
or the worlds consistent with the what ever contradicts the observed faults (consistency-based diagnosis) [Console91].

Probing

A related task to diagnosis is probing; i.e. the selected exploration of assumptions in order to rule out competing diagnosis. To implement this:

Count how many times an assumption appears in various worlds;
Query the assumptions that appear in the fewest worlds.

Planning

If we have a $cost(A_i)$ predicate, then planning is favoring the worlds with the least cost.

Explanation

If we have a $user$ model listing concepts that have been seen before, then the best explanation is that which uses the most familiar concepts; i.e. the one that maximizes $|A_i \cap user|$

For more on applications in this framework, see [Menzies96].

Warning: TOE is Slooooooow

We can make some comments about the complexity of the above:

If we ignore $T\sqcup A\;\not\vdash \perp$ (i.e. what can go wrong), then we can generate conclusions very fast. For example, deduction (that ignores rule two) runs in linear time [Dowling84].
In general abduction is NP-hard [Bylander91].
- Certain restrictive specializations run much faster [Zanuttini03, Menzies02].
- However exponential runtimes curse many implementations; e.g. [Menzies99].

TOE and Data Mining

The above is silent how we find solutions to rules one and two. There are many ways to do this:

Backward chaining (as done in THEORIST;
Forward chaining (as done in the ATMS);
A constraint solver [Kakas01];
etc

What does the above look like in a data mining framework?

Take a theory $T$ ;
Exercise the theory to generate data $D$
We will assume that some subset of the data has been passed through some scoring function $score(D_i)$ (and if that function returns descrete values, will call that "score" a "classification".
Apply a data miner to learn assumptions $D$

Data Mining: A Bad Idea?

Data Mining

What is data mining? According to Fayyad96, it looks like this:

In practice:

Most of "data mining" is actually "data preprocessing" (hours to days of scripting- learn Bash and Awk/Perl/Python);
The proces is cyclic. Results, version1, leads to "that's funny" followed by version2, followed by...

Important to review, not blindly accept, the learned theories:

Toss a coin. At probability , the next coin toss will be repeated times. If you toss a coin 100 times, then we would expect at least one sequence of size six. [Courtney85
- Looked at randomly generated numbers in the ranges that would be expected in Halstead's software science measures.
- The average maximum linear correlation for randomly generated numbers is 0.70 (or higher if the sample size is low compared to the number of variables).
Discretization

Group together numbers in a continuous range into a smaller number of discrete bins [41]. In other words, discretization reduces the dimensionality of each term.
Rotation

A matrix T*D has axis along T and D. New axis can be inferred using a variety of techniques including Principal Component Analysis (PCA) [86], Latent Dirichlet Allocation (LDA) [14], and FASTMAP [37]. Geometrically, these new axes fall at some rotation to the original axes. For example, PCA is a vector space transform often used to reduce multidimensional data set to lower dimensions for analysis. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or a single value decomposition of, say, a T*D matrix.
1. A two-dimensional training set with postive examples as black circles and negative examples as white circles. The decision boundary $x_1^2\;+\; x_2^2 \; \le \; 1$ is also shown.
2. The same data after mapping into a three-dimensional input space $x_1^2$ , $\;x_2^2$ , $\;\sqrt{2}x_1x_2$ .
Note that the circular decision boundary in (a) becomes a linear decision boundary in the three dimensions of (b). From [Russell10].
Projection

In SQL, a project query returns a subset of the columns. In text/data mining, projection operators delete the least informative terms; i.e., it reduces the dimensions of T1*D matrix to T2*D matrix where T2 ! T1. A common technique relies on the Singular Value Decomposition (SVD) [101] of the original matrix. Projection is often a batch process, applied once before any queries are processed [44, 51]. For text mining applications, projection is an essential pre-processing step that proceeds all the other processing since, without projection, we must reason over an intractable number of terms. hard: kernel easy: tfidf
Selection

In SQL, a select query returns a subset of the rows. In text/data mining, select operators select the subset of the data relevant to some current query. That is, selection reduces the dimensions of a T*D1 matrix to T*D2 matrix where D2 ! D1. Selection is very useful for removing outliers and anomalous examples from a reasoning process. Unlike projection, the selection operator may have to be repeated for each new query and so, usually, it can significantly slow down query processing. However, selection can be applied as a batch task, for example, discarding excessive examples of some over-abundant dependent term. This is a useful pre-processor since, as Drummond and Holte report [31], it allows us to learn predictors for rare dependent terms which, otherwise, would be lost in a crowd of other dependent terms. hard: k-means easier: genic discretization

Data Mining Tut: MSR June'10