A Bayes classifier is a simple statistical-based learning scheme.
Advantages:
Assumptions
It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.
weather.symbolic.arff
outlook temperature humidity windy play
------- ----------- -------- ----- ----
rainy cool normal TRUE no
rainy mild high TRUE no
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no
overcast cool normal TRUE yes
overcast hot high FALSE yes
overcast hot normal FALSE yes
overcast mild high TRUE yes
rainy cool normal FALSE yes
rainy mild high FALSE yes
rainy mild normal FALSE yes
sunny cool normal FALSE yes
sunny mild normal TRUE yes%%
This data can be summarized as follows:
Outlook Temperature Humidity
==================== ================= =================
Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4
Overcast 4 0 Mild 4 2 Normal 6 1
Rainy 3 2 Cool 3 1
----------- --------- ----------
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5
Rainy 3/9 2/5 Cool 3/9 1/5
Windy Play
================= ========
Yes No Yes No
False 6 2 9 5
True 3 3
---------- ----------
False 6/9 2/5 9/14 5/14
True 3/9 3/5
So, what happens on a new day:
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?%%
First find the likelihood of the two classes
So, we aren't playing golf today.
More generally, the above is just an application of Bayes' Theorem.
Pr[E | H ] * Pr[H] Pr[H | E] = ------------------- Pr[E]
Pr[E1 | H ]* Pr[E2 | H ] * .... *Pr[En | H ]Pr[H ] Pr[H | E] = --------------------------------------------------- Pr[E]
We used this above. Here's our evidence:
Outlook Temp. Humidity Windy Play Sunny Cool High True ?
Here's the probability for "yes":
Pr[ yes | E] = Pr[Outlook = Sunny | yes] * Pr[Temperature = Cool | yes] * Pr[Humidity = High | yes] * Pr[ yes] Pr[Windy = True | yes] * Pr[yes] / Pr[E] = (2/9 * 3/9 * 3/9 * 3/9) * 9/14) / Pr[E]
Return the classification with highest probability
Bayes classifiers perform but they do not explain their performance. If you ask "what is going on? how does it make its decisions?", there's no answer except to browse the complicated frequency count tables.
Q: So, how to learn a decision tree whose leaves are classifications and whose internal nodes are tests on attributes?
curb-weight <= 2660 : | curb-weight <= 2290 : | | curb-weight <= 2090 : | | | length <= 161 : price=6220 | | | length > 161 : price=7150 | | curb-weight > 2090 : price=8010 | curb-weight > 2290 : | | length <= 176 : price=9680 | | length > 176 : | | | normalized-losses <= 157 : price=10200 | | | normalized-losses > 157 : price=15800 curb-weight > 2660 : | width <= 68.9 : price=16100 | width > 68.9 : price=25500
entropy(p1,p2,...) =∑ -p * log2(p) ;; log2 = log base 2
Back to tree learning...
Which feature splits generates symbols that are less mixed up?
Which is the best attribute to split on?
Compare the number of bits required to encode the splits with the number of bits required to encode the un-split data.
e.g. Outlook= sunny
Outlook = overcast
Outlook = rainy
Expected info for Outlook = Weighted sum of the above
Computing the information gain
Repeatedly split recursively: