Data Mining Tut: MSR June'10


 
 

Tim Menzies
http://menzies.us

Beyond Mining Repositories

XXXX as with all things, Holte and Rams. 1r cs C4.5 GAs vs TEAC which vs other learners and the koru heuristics tdidf vs pca genic vs k-means tar3 vs standard optimizers

Goals

Q1:So, now you can mine software repositories. What else can you do?

Q2: Is too much data mining a bad thing (torturing the data)?

Q3: Is data mining a complex task?

In this talk, we expand on A1, A2, A3, A4.

TOE = A Theory of Everything

Let , , be a theory, goals, and some assumptions.

Let there there exist some predicate that can report inconsistent assumptions e.g.:

Find the assumptions that satisfy rules one and two:

  1. (do something)
  2. (don't screw up)

What can we do with the above? Well....

Induction

If "" assert and retract parts of the theory [Hirata95, Flach97].

Deduction

Ignore rule2; when the assumptions are known inputs; then and the goals are the reachable outputs condonned by the theory .

Abduction

Find all solutions to the above two equations and sort them into worlds , of belief:

(BTW, [Poole94] calls these worlds "scenarios" while [DeKleer86] called them "envisonments" ).

The "best" worlds (that will will use, somehow) are selected by ;

By selecting different operators, we can implement prediction, verification, classification, diagnosis, probing, planning, explanation, ...

Prediction

Prediction is deduction in each world.

Verification

Also, if the goals are known observations then verification returns the world with greatest number of known observations i.e. [Menzies96].

Classification

Classification is just prediction where contains reserved atoms tagged as classifications just deduction in each world (runs fast, see above).

Diagnosis

Diagnosis is selecting the world(s) that includes:

Probing

A related task to diagnosis is probing; i.e. the selected exploration of assumptions in order to rule out competing diagnosis. To implement this:
  1. Count how many times an assumption appears in various worlds;
  2. Query the assumptions that appear in the fewest worlds.

Planning

If we have a predicate, then planning is favoring the worlds with the least cost.

Explanation

If we have a model listing concepts that have been seen before, then the best explanation is that which uses the most familiar concepts; i.e. the one that maximizes

More

For more on applications in this framework, see [Menzies96].

Warning: TOE is Slooooooow

We can make some comments about the complexity of the above:

TOE and Data Mining

The above is silent how we find solutions to rules one and two. There are many ways to do this:

What does the above look like in a data mining framework?

Data Mining: A Bad Idea?

Data Mining

What is data mining? According to Fayyad96, it looks like this:

In practice:

Important to review, not blindly accept, the learned theories: