A Bayes classifier is a simple statistical-based learning scheme.
Advantages:
Assumptions
It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.
weather.symbolic.arff
outlook temperature humidity windy play
------- ----------- -------- ----- ----
rainy cool normal TRUE no
rainy mild high TRUE no
sunny hot high FALSE no
sunny hot high TRUE no
sunny mild high FALSE no
overcast cool normal TRUE yes
overcast hot high FALSE yes
overcast hot normal FALSE yes
overcast mild high TRUE yes
rainy cool normal FALSE yes
rainy mild high FALSE yes
rainy mild normal FALSE yes
sunny cool normal FALSE yes
sunny mild normal TRUE yes%%
This data can be summarized as follows:
Outlook Temperature Humidity
==================== ================= =================
Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4
Overcast 4 0 Mild 4 2 Normal 6 1
Rainy 3 2 Cool 3 1
----------- --------- ----------
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5
Rainy 3/9 2/5 Cool 3/9 1/5
Windy Play
================= ========
Yes No Yes No
False 6 2 9 5
True 3 3
---------- ----------
False 6/9 2/5 9/14 5/14
True 3/9 3/5
So, what happens on a new day:
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?%%
First find the likelihood of the two classes
So, we aren't playing golf today.
More generally, the above is just an application of Bayes' Theorem.
Pr[E | H ] * Pr[H] Pr[H | E] = ------------------- Pr[E]
Pr[E1 | H ]* Pr[E2 | H ] * .... *Pr[En | H ]Pr[H ] Pr[H | E] = --------------------------------------------------- Pr[E]
We used this above. Here's our evidence:
Outlook Temp. Humidity Windy Play Sunny Cool High True ?
Here's the probability for "yes":
Pr[ yes | E] = Pr[Outlook = Sunny | yes] * Pr[Temperature = Cool | yes] * Pr[Humidity = High | yes] * Pr[ yes] Pr[Windy = True | yes] * Pr[yes] / Pr[E] = (2/9 * 3/9 * 3/9 * 3/9) * 9/14) / Pr[E]
Return the classification with highest probability
Bayes classifiers perform but they do not explain their performance. If you ask "what is going on? how does it make its decisions?", there's no answer except to browse the complicated frequency count tables.
Q: So, how to learn a decision tree whose leaves are classifications and whose internal nodes are tests on attributes?
curb-weight <= 2660 : | curb-weight <= 2290 : | | curb-weight <= 2090 : | | | length <= 161 : price=6220 | | | length > 161 : price=7150 | | curb-weight > 2090 : price=8010 | curb-weight > 2290 : | | length <= 176 : price=9680 | | length > 176 : | | | normalized-losses <= 157 : price=10200 | | | normalized-losses > 157 : price=15800 curb-weight > 2660 : | width <= 68.9 : price=16100 | width > 68.9 : price=25500
entropy(p1,p2,...) =∑ -p * log2(p) ;; log2 = log base 2
Back to tree learning...
Which feature splits generates symbols that are less mixed up?
Which is the best attribute to split on?
Compare the number of bits required to encode the splits with the number of bits required to encode the un-split data.
e.g. Outlook= sunny
Outlook = overcast
Outlook = rainy
Expected info for Outlook = Weighted sum of the above
Computing the information gain
Repeatedly split recursively:
Problem with tree learning: BIG bushy trees. Any way to learn simpler models?
The KEYS heuristics: a few variables set the rest. That is, in any model/ data set, we only need to find the small number of things that most effect the output.
Example. In the following, each feature range is scored by how much it contributes to the goal of maximizing defects. We will write a learner that focuses on just the top scoring things. WHICH, and WHICH2, will go to that space, prune anything that is less than 10% of the max, then will just try mixing up the remaining, small number of most important ideas.
Now we'll implement a stochastic variant of beam search. The code is in Awk (an interpreted version of "C") so it won't be as fast as LISP. But it does illustrate many of the issues of real world learning.
WHICH is a stochastic rule learner first implemented by Zach Milton in his WVU masters thesis. This program builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.
One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple # matter to try our different search biases.
function score(i,rows,data, cols,row, col,a,b,c,d,triggered,pd,pf,prec,acc,supports,s,fits) { a=b=c=d=Pinch # stop divide by zero errors cols=Name[0] for(row=1;row<=rows;row++) { triggered = matched(row,cols,data,i) if (data[row,cols]) { if (triggered) {d++} else {b++} } else { if (triggered) {c++} else {a++} } } fits = c + d pd = d/(b+d) pf = a/(a+c) prec = d/(c+d) acc = (a+d)/(a+b+c+d) support = (c+d)/(a+b+c+d) return score1(pd,pf,prec,acc,support,fits) } function score1(pd,pf,prec,acc,supportm,fits) { if (fits <= OverFitted) return 0 if (Eval==1) return acc if (Eval==2) return 2 * pd * prec/(pd+prec) if (Eval==3) return 2 * pd * pf/(pd+pf) if (Eval==4) return support * 2 * pd * pf/(pd+pf) return support * 2 * pd * prec/(pd+prec) }
Various students (Mr. Milton, and students in this class last year), implemented WHICH and found it surprisingly slow. For example, in the above, note that scoring one rule means running it over the entire data set.
WHICH2 is a variant of WHICH that avoids certain runtime traps of WHICH.
gawk -f which2.awk weather.nominal.arff
The input format is "arff", which looks like this:
@relation weather.nominal @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
The output looks like what we see below. "Candidates" are ideas that look promsing and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.
Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.
Each round prunes the candiates so that only the better candiadtes surive to round+1.
To read the following, note that "1 1st" means "feature 1 = 1st".
---| round 1 |----------------------------------- max : 0 lives : 5 candidate: 1,overcast candidate: 1,overcast,2,cool candidate: 1,overcast,3,normal candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,3,normal candidate: 2,cool,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 1,overcast,3,normal ] = 0.052266444 score[ 2,mild,3,normal ] = 0.052630836 score[ 2,mild,4,FALSE ] = 0.071545436 score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,mild ] = 0.22873855 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 ---| round 2 |----------------------------------- max : 0.4038298 lives : 5 candidate: 1,overcast candidate: 1,overcast,2,mild candidate: 1,overcast,3,normal candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,3,normal candidate: 2,cool,3,normal,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal candidate: 2,mild,3,normal,4,FALSE candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool,3,normal,4,FALSE ] = 0.052911406 score[ 2,mild,4,FALSE ] = 0.071545436 score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,mild ] = 0.22873855 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 3 |----------------------------------- max : 0.52651565 lives : 5 candidate: 1,overcast candidate: 1,overcast,2,cool,2,mild candidate: 1,overcast,2,mild,4,FALSE candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,3,normal,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal,4,FALSE candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 4 |----------------------------------- max : 0.52651565 lives : 4 candidate: 1,overcast candidate: 1,overcast,2,cool,2,mild candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 5 |----------------------------------- max : 0.52651565 lives : 3 candidate: 1,overcast candidate: 1,overcast,2,cool,3,normal candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,3,normal,4,FALSE candidate: 2,cool,4,FALSE candidate: 2,mild candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 6 |----------------------------------- max : 0.52651565 lives : 2 candidate: 1,overcast candidate: 1,overcast,2,cool,2,mild candidate: 1,overcast,3,normal,4,FALSE candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,3,normal,4,FALSE candidate: 2,mild candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 7 |----------------------------------- max : 0.52651565 lives : 1 candidate: 1,overcast candidate: 1,overcast,3,normal,4,FALSE candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal candidate: 2,mild,3,normal,4,FALSE candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565 ---| round 8 |----------------------------------- max : 0.52651565 lives : 0 candidate: 1,overcast candidate: 1,overcast,4,FALSE candidate: 2,cool candidate: 2,cool,2,mild candidate: 2,cool,2,mild,3,normal candidate: 2,cool,2,mild,3,normal,4,FALSE candidate: 2,cool,2,mild,4,FALSE candidate: 2,cool,3,normal candidate: 2,cool,3,normal,4,FALSE candidate: 2,cool,4,FALSE candidate: 2,mild candidate: 2,mild,3,normal candidate: 2,mild,4,FALSE candidate: 3,normal candidate: 3,normal,4,FALSE candidate: 4,FALSE score[ 2,cool ] = 0.13229759 score[ 2,cool,3,normal ] = 0.13260519 score[ 1,overcast ] = 0.17629991 score[ 3,normal,4,FALSE ] = 0.17631895 score[ 2,cool,2,mild,4,FALSE ] = 0.20504833 score[ 2,mild ] = 0.22873855 score[ 2,cool,2,mild,3,normal ] = 0.28613204 score[ 3,normal ] = 0.37500686 score[ 4,FALSE ] = 0.4038298 score[ 2,cool,2,mild ] = 0.52651565
Some constants:
BEGIN { Goal="yes"; # Best = the Goal class. Rest = anything else Seed=1 # Random number see. More = 1.02; # Improvement means at least a 2% growth Lives=5; # If no improvement after five rounds, give up Dull=0.1; # Ignore candidates with score < Dull*MaxScore Beam=10; # Only the top (say) 10 candidates survive to the next round Samples=20; # Pick this number of pairs of candidates from the last round Pinch = 1/1000; # Add a random number of up to "Pinch" to each score OverFitted=3; # When do we prune a rule that matches on too few instances? CONVFMT="%.8g"; # Increase the string size for array contents so we can see the Pinch IGNORECASE=1; _ = SUBSEP; C="," }
Data entry. Pretty routine stuff.
/@attribute/ {Name[++Name[0]]=$2} {gsub(/[ \t]*/,"")} # no blanks {gsub(/#.*/,"")} # no comments /^$/ {next} # no blank likes /@data/ {In=1;FS=","; srand(Seed)} /@/ {next} In {Rows++; train(All,Best,H,Rows,Data,$NF==Goal)} function train(all,best,h,rows,d,class, i) { h[class]++ for(i=1;i<=NF;i++) { if ($i == "?") continue; if (i == NF) { d[rows,i]=class } else { d[rows,i]=$i all[i,$i]++ if (class) best[i,$i]++ }} }
Now we can begin. In round0, offer a rough ranking of the ranges. In subequent rounds, randomly select and combine ranges from prior randoms.
END {round0(All,Best,H,Hot0); # make some initial guess rounds(1,0,Lives,Hot0,Rows,Data,Out) }
In round one, score by a small variant of bayes. Given two classes (best=b and rest=r), we could use b/(b+r) but this can be confused by a low-frequency problem (when "b" is, say, twice as common as "r" but "b" and "r" occur very rarely in the data).
So we'll add a support term, by squaring "b" (if "b" is large, then we see more of "b") to evaluate by b^2/(b+r).
function round0(all,best,h,hot, i,b,r,memo,score) { for(i in best) { r = (all[i] - best[i])/h[0] b = best[i]/h[1] if (b > r) { s = b^2/(b+r) + rand()*Pinch memo[s] = i score[i]= s }} chop(score,memo,hot) # prune the dull candidates }
Given some score[key]=number and memo[number]=key, sort the scores and return the top Beam number of keys, pruning all keys less than Dull times the max score.
function chop(score0,memo,out, score,n,i) { n=asort(score0,score) for(i=n;i>=1;i--) { if (score[i] <= score[n]*Dull) break; if (i <= n - Beam) break out[memo[score[i]]] = score[i] } }
In subsequent rounds one, score all the candidates by running that combination over the data (see the "score" function. Note the "score" paramter that caches prior calcuations of the score. This speeds up the code by a factor of four (for large data sets).
function rounds(round, max0,lives,hot0,rows,data,out,score, \ max,i,sample,hot1,s,memo,hot2) { if (round == 1) max=0 else { # terminate if we have stopped improving max = most(hot0) lives = (max > (max0*More)) ? Lives : lives - 1 if(lives < 0) { # if termination, copy input to out for(i in hot0) out[i] = hot0[i] return max } } print "\n---| round: " round " seed: " Seed " |-----------------------------------" print "max : " max "\nlives : " lives normalize(hot0) # make all the counts n= 1..100 explode(hot0,sample) # copy items n times twos(sample,Samples,hot1) # pick items at random from that sample for(i in hot0) # add in the last rounds' ideas hot1[i] = i; values(hot1,"candidate") for(i in hot1) { # score the new picks and the last rounds's ideas s = (i in score) ? score[i] : score(i,rows,data) + rand()*Pinch memo[s] = i score[i] = s } chop(score,memo,hot2) # prune the dull candidates o(hot2,"score","-n -k 5") return rounds(round+1,max,lives,hot2,rows,data,out,score) }
Randomly pick pairs and combine them. Note that, in the following code, the picks come from "sample" where an item may be repeated many times (so things that are often repeated are more likely to be picked).
"n" times, pick two things from "sample" and store them in "sampled"
function twos(sample,n,sampled, pair) { while(n--) { pair= two(sample) sampled[pair]=pair } }
Pick two things at random. Try not to pick the same thing twice. Return the combination of the two things you pick.
function two(sample, tries, this, that) { this = one(sample) if(tries == 9) # nine lives return this that = one(sample) if (this == that) return two(sample,tries + 1) else return combine(this,that) }
Combine two rules. don't repeat any ranges. sort them so that all the ranges of the same feature fall together.
function combine(this,that, used,tmp,out) { split(this "," that,tmp,",") n=asort(tmp) out = tmp[1] used[tmp[1]]=1 for(i=2;i<=n;i++) if (!used[tmp[i]]) { out = out "," tmp[i] used[tmp[i]] = 1 } return out }
Score a rule by finding its matching rows in data. Given "this" of the form "f1_v1,f2_v2,...." see if "row" matches "this". Assumes that disjunctions are modeled as configuous values from the same feature (this is gaurenteed by "combine"). Hence, whenever we move to a new feature, we need to check that at least of the values mentioned with the old feature was found.
function matched(row,cols,data,this, col,n,goals,pair,f0,f,status) { n=split(this,goals,",") for(col=1;col<=n;col++) { split(goals[col],pair,_) f = pair[1] status[f] += data[row,f] == pair[2] if (f0 && (f != f0) && !status[f0]) return 0 f0 = f } return status[f] }
Here are some interesting utils:
Given an array a[i]=n, fill up "out" with "n" number of "i". This creates a "sample" of things, from which we can pick randomly biased by the relative frequencies of "n". The total size of the sample is stored in "sample[0]"
function explode(a, out,i,j) { for(i in a) for(j=1;j<=a[i];j++) out[++out[0]] = i }
Pick any item at random from "sample". Assumes that the same size is in array element "sample[0]"
function one(sample, any) { any = int(rand()* sample[0])+1 return sample[any] }
Boring utils
Given an array a[i]=num, normalize all the numbers to integers 0..100
function normalize(a, i,sum) { for(i in a) sum += a[i] for(i in a) a[i] = int(100*a[i]/sum) }
Combine an feature/ range
function fv(f,v) { return f _ v }
Find the max item in an array
function most(a, i,max) { max = -1000000000 for(i in a) if (a[i] > max) max = a[i]; return max }
print array values
function values(a,s,what, i,j,com) { print "" what = what ? what : "" com = "sort " what for(i in a) { j=a[i]; gsub(_,",",j); print s": " j | com; } close(com) }
Print an array, sorted by "what"
function o(a,s,what, i,j,com) { print "" what = what ? what : "" com = "sort " what for(i in a) { j=i; gsub(_,",",j); print s"[ "j" ] = " a[i] | com; } close(com) }