Learning

Bayes Classifiers

A Bayes classifier is a simple statistical-based learning scheme.

Advantages:

Tiny memory footprint
Fast training, fast learning
Simplicity
Often works surprisingly well

Assumptions

Learning is done best via statistical modeling
Attributes are
- equally important
- statistically independent (given the class value)
- This means that knowledge about the value of a particular attribute doesn't tell us anything about the value of another attribute (if the class is known)
Although based on assumptions that are almost never correct, this scheme works well in practice [Domingos97]

It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.

Example

weather.symbolic.arff

outlook  temperature  humidity   windy   play
-------  -----------  --------   -----   ----
rainy    cool        normal    TRUE    no
rainy    mild        high      TRUE    no
sunny    hot         high      FALSE   no
sunny    hot         high      TRUE    no
sunny    mild        high      FALSE   no
overcast cool        normal    TRUE    yes
overcast hot         high      FALSE   yes
overcast hot         normal    FALSE   yes
overcast mild        high      TRUE    yes
rainy    cool        normal    FALSE   yes
rainy    mild        high      FALSE   yes
rainy    mild        normal    FALSE   yes
sunny    cool        normal    FALSE   yes
sunny    mild        normal    TRUE    yes%%

This data can be summarized as follows:


           Outlook            Temperature           Humidity
====================   =================   =================
          Yes    No            Yes   No            Yes    No
Sunny       2     3     Hot     2     2    High      3     4
Overcast    4     0     Mild    4     2    Normal    6     1
Rainy       3     2     Cool    3     1
          -----------         ---------            ----------
Sunny     2/9   3/5     Hot   2/9   2/5    High    3/9   4/5
Overcast  4/9   0/5     Mild  4/9   2/5    Normal  6/9   1/5
Rainy     3/9   2/5     Cool  3/9   1/5

            Windy        Play
=================    ========
      Yes     No     Yes   No
False 6      2       9     5
True  3      3
      ----------   ----------
False  6/9    2/5   9/14  5/14
True   3/9    3/5

So, what happens on a new day:

Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?%%

First find the likelihood of the two classes

For "yes" = 2/9 * 3/9 * 3/9 * 3/9 * 9/14 = 0.0053
For "no" = 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
Conversion into a probability by normalization:
- P("yes") = 0.0053 / (0.0053 + 0.0206) = 0.205
- P("no") = 0.0206 / (0.0053 + 0.0206) = 0.795

So, we aren't playing golf today.

Bayes' rule

More generally, the above is just an application of Bayes' Theorem.

Probability of event H given evidence E:

             Pr[E | H ] * Pr[H]
Pr[H | E] =  -------------------
                  Pr[E]

A priori probability of H= Pr[H]
- Probability of event before evidence has been seen
A posteriori probability of H= Pr[H|E]
- Probability of event after evidence has been seen
Classification learning: what's the probability of the class given an instance?
- Evidence E = instance
- Event H = class value for instance
Naive Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!

            Pr[E1 | H ]* Pr[E2 | H ] * ....  *Pr[En | H ]Pr[H ]
Pr[H | E] = ---------------------------------------------------
                               Pr[E]

We used this above. Here's our evidence:

Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?

Here's the probability for "yes":

Pr[ yes | E] = Pr[Outlook     = Sunny | yes] *
               Pr[Temperature = Cool  | yes] *
               Pr[Humidity     = High  | yes] * Pr[ yes]
               Pr[Windy       = True  | yes] * Pr[yes] / Pr[E]
             = (2/9 * 3/9 * 3/9 * 3/9)       * 9/14)   / Pr[E]

Return the classification with highest probability

Probability of the evidence Pr[E]
- Constant across all possible classifications;
- So, when comparing N classifications, it cancels out

Decision Trees

Bayes classifiers perform but they do not explain their performance. If you ask "what is going on? how does it make its decisions?", there's no answer except to browse the complicated frequency count tables.

Q: So, how to learn a decision tree whose leaves are classifications and whose internal nodes are tests on attributes?

curb-weight <= 2660 : 
    |   curb-weight <= 2290 : 
    |   |   curb-weight <= 2090 : 
    |   |   |   length <= 161 : price=6220
    |   |   |   length >  161 : price=7150
    |   |   curb-weight >  2090 : price=8010
    |   curb-weight >  2290 : 
    |   |   length <= 176 : price=9680
    |   |   length >  176 : 
    |   |   |   normalized-losses <= 157 : price=10200
    |   |   |   normalized-losses >  157 : price=15800
    curb-weight >  2660 : 
    |   width <= 68.9 : price=16100
    |   width >  68.9 : price=25500

Preliminaries: Entropy

entropy of a bunch of symbols occurring with probability p1, p2, ... then
```
entropy(p1,p2,...) =∑ -p * log2(p)   ;; log2 = log base 2
```
Hints:
- if your favorite programming language has no "base 2", then use log2(x)=log(x)/log(2)
- If "x" occurs f times in a sample of size "n", then "p = f/n"
Examples:
- Entropy of 2 apples and 3 oranges:
  entropy(2/5,3/5) = -2/5 * log(2/5) - 3/5 * log(3/5) = 0.971 bits
- Entropy of 4 apples and no oranges:
  entropy(4/4,0/4) = entropy(1) = 0 (i.e. no mixed up)
Simplification trick:
- entropy([2/9,3/9,4/9])
  = -2/9 * log(2/9) - 3/9 * log(3/9) - 4/9 * log(4/9)
  = [ -2 * log(2) - 3 * log(3) - 4 * log(4) + 9 * log(9)]/9

Iterative Dichotomization

Back to tree learning...

Given a bag of mixed-up stuff.
- Need a measure of "mixed-up-ness" (entropy).
Split: Find something that divides up the bag in two new sub-bags
- And each sub-bag is less mixed-up;
- Each split is the root of a sub-tree.
Recurse: repeat for each sub-bag
- i.e. on just the data that falls into each part of the split
  - Need a Stop rule
  - Condense the instances that fall into each sub-bag (report majority class).

Example

Which feature splits generates symbols that are less mixed up?

Which is the best attribute to split on?

The one which will result in the smallest tree
Heuristic: choose the attribute that produces the "purest" nodes
Purity = not-mixed-up
Popular impurity criterion: information gain
Information gain increases with the average purity
of the subsets that an attribute produces
Strategy: choose attribute that results in greatest information gain.

Compare the number of bits required to encode the splits with the number of bits required to encode the un-split data.

Entropy of un-split data = entropy(9/14,5.14)=
-5/14*log(5/14)/log(2) - 9/14*log(9/14)/log(2) = 0.94
Entropy of the split data:
- Weighted sum of the entropy of the splits of size, say, 5, 4, 5
- 5/14 * ent1 + 4/14 * ent2 + 5/14 * ent3

e.g. Outlook= sunny

info([2,3])= entropy(2/5,3/5) = -2/5 * log(2/5) - 3/5 * log(3/5) = 0.971 bits

Outlook = overcast

info([4,0]) = entropy(1,0) = -1 * log(1) - 0 * log(0) = 0 bits

Outlook = rainy

info([3,2]) = entropy(3/5, 2/5) = -3/5 * log(3/5) - 2/5 * log(2/5) = 0.971 bits

Expected info for Outlook = Weighted sum of the above

info([3,2],[4,0],[3,2]) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.693

Computing the information gain

e.g. information before splitting minus information after splitting
e.g. gain for attributes from weather data:
gain("Outlook") = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940 - 0.963 = 0.247 bits
gain("Temperature") = 0.247 bits
gain("Humidity") = 0.152 bits
gain("Windy") = 0.048 bits

Repeatedly split recursively:

Note, final tree: not all leaves are pure
Splitting stops when data can't be split any further
Or too few examples left to split (the -M 2 flag in J48)