Learning

Bayes Classifiers

A Bayes classifier is a simple statistical-based learning scheme.

Advantages:

Assumptions

It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.

Example

weather.symbolic.arff

outlook  temperature  humidity   windy   play
-------  -----------  --------   -----   ----
rainy    cool        normal    TRUE    no
rainy    mild        high      TRUE    no
sunny    hot         high      FALSE   no
sunny    hot         high      TRUE    no
sunny    mild        high      FALSE   no
overcast cool        normal    TRUE    yes
overcast hot         high      FALSE   yes
overcast hot         normal    FALSE   yes
overcast mild        high      TRUE    yes
rainy    cool        normal    FALSE   yes
rainy    mild        high      FALSE   yes
rainy    mild        normal    FALSE   yes
sunny    cool        normal    FALSE   yes
sunny    mild        normal    TRUE    yes%%

This data can be summarized as follows:


           Outlook            Temperature           Humidity
====================   =================   =================
          Yes    No            Yes   No            Yes    No
Sunny       2     3     Hot     2     2    High      3     4
Overcast    4     0     Mild    4     2    Normal    6     1
Rainy       3     2     Cool    3     1
          -----------         ---------            ----------
Sunny     2/9   3/5     Hot   2/9   2/5    High    3/9   4/5
Overcast  4/9   0/5     Mild  4/9   2/5    Normal  6/9   1/5
Rainy     3/9   2/5     Cool  3/9   1/5

            Windy        Play
=================    ========
      Yes     No     Yes   No
False 6      2       9     5
True  3      3
      ----------   ----------
False  6/9    2/5   9/14  5/14
True   3/9    3/5

So, what happens on a new day:

Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?%%

First find the likelihood of the two classes

So, we aren't playing golf today.

Bayes' rule

More generally, the above is just an application of Bayes' Theorem.

Return the classification with highest probability

Decision Trees

Bayes classifiers perform but they do not explain their performance. If you ask "what is going on? how does it make its decisions?", there's no answer except to browse the complicated frequency count tables.

Q: So, how to learn a decision tree whose leaves are classifications and whose internal nodes are tests on attributes?

curb-weight <= 2660 : 
    |   curb-weight <= 2290 : 
    |   |   curb-weight <= 2090 : 
    |   |   |   length <= 161 : price=6220
    |   |   |   length >  161 : price=7150
    |   |   curb-weight >  2090 : price=8010
    |   curb-weight >  2290 : 
    |   |   length <= 176 : price=9680
    |   |   length >  176 : 
    |   |   |   normalized-losses <= 157 : price=10200
    |   |   |   normalized-losses >  157 : price=15800
    curb-weight >  2660 : 
    |   width <= 68.9 : price=16100
    |   width >  68.9 : price=25500

Preliminaries: Entropy

Iterative Dichotomization

Back to tree learning...

Example

Which feature splits generates symbols that are less mixed up?

Which is the best attribute to split on?

Compare the number of bits required to encode the splits with the number of bits required to encode the un-split data.

e.g. Outlook= sunny

Outlook = overcast

Outlook = rainy

Expected info for Outlook = Weighted sum of the above

Computing the information gain

Repeatedly split recursively:

Problem with tree learning: BIG bushy trees. Any way to learn simpler models?

Learning Simple Rules

The KEYS heuristics: a few variables set the rest. That is, in any model/ data set, we only need to find the small number of things that most effect the output.

Example. In the following, each feature range is scored by how much it contributes to the goal of maximizing defects. We will write a learner that focuses on just the top scoring things. WHICH, and WHICH2, will go to that space, prune anything that is less than 10% of the max, then will just try mixing up the remaining, small number of most important ideas.

Now we'll implement a stochastic variant of beam search. The code is in Awk (an interpreted version of "C") so it won't be as fast as LISP. But it does illustrate many of the issues of real world learning.

WHICH is a stochastic rule learner first implemented by Zach Milton in his WVU masters thesis. This program builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.

One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple # matter to try our different search biases.

function score(i,rows,data,   
			 cols,row, col,a,b,c,d,triggered,pd,pf,prec,acc,supports,s,fits) {
    a=b=c=d=Pinch # stop divide by zero errors
    cols=Name[0]
    for(row=1;row<=rows;row++) {
        triggered = matched(row,cols,data,i)
        if (data[row,cols]) {
            if (triggered) {d++} else {b++}
        } else {
            if (triggered) {c++} else {a++}
        }
    } 
	fits    = c + d
    pd      = d/(b+d)
    pf      = a/(a+c)
    prec    = d/(c+d)
    acc     = (a+d)/(a+b+c+d)
    support = (c+d)/(a+b+c+d)
    return score1(pd,pf,prec,acc,support,fits)
}
function score1(pd,pf,prec,acc,supportm,fits) {
	if (fits <= OverFitted)
		return 0
    if (Eval==1) return acc
    if (Eval==2) return 2 * pd * prec/(pd+prec)
    if (Eval==3) return 2 * pd * pf/(pd+pf)
    if (Eval==4) return support * 2 * pd * pf/(pd+pf)
    return support * 2 * pd * prec/(pd+prec)
}

Various students (Mr. Milton, and students in this class last year), implemented WHICH and found it surprisingly slow. For example, in the above, note that scoring one rule means running it over the entire data set.

WHICH2 is a variant of WHICH that avoids certain runtime traps of WHICH.

  1. Rather that sorting new ideas straight away into the old ones, it generates (say) 20 new ones at each round before sorting them back into the old.
  2. WHICH2 keeps a cache of how ideas scored before. If we re-score an old idea, we just return that score (no need to recompute it). The result is pretty zip: under a second for the 2000+ records of titanic.arff. Not bad for an interpreted system.
To call the system, get the source code and call it as follows:
gawk -f which2.awk weather.nominal.arff 

The input format is "arff", which looks like this:

@relation weather.nominal

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

The output looks like what we see below. "Candidates" are ideas that look promsing and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.

Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.

Each round prunes the candiates so that only the better candiadtes surive to round+1.

To read the following, note that "1 1st" means "feature 1 = 1st".

---| round 1 |-----------------------------------
max   : 0
lives : 5

candidate: 1,overcast
candidate: 1,overcast,2,cool
candidate: 1,overcast,3,normal
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,3,normal
candidate: 2,cool,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 1,overcast,3,normal ] = 0.052266444
score[ 2,mild,3,normal ] = 0.052630836
score[ 2,mild,4,FALSE ] = 0.071545436
score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,mild ] = 0.22873855
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298

---| round 2 |-----------------------------------
max   : 0.4038298
lives : 5

candidate: 1,overcast
candidate: 1,overcast,2,mild
candidate: 1,overcast,3,normal
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,3,normal
candidate: 2,cool,3,normal,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal
candidate: 2,mild,3,normal,4,FALSE
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool,3,normal,4,FALSE ] = 0.052911406
score[ 2,mild,4,FALSE ] = 0.071545436
score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,mild ] = 0.22873855
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 3 |-----------------------------------
max   : 0.52651565
lives : 5

candidate: 1,overcast
candidate: 1,overcast,2,cool,2,mild
candidate: 1,overcast,2,mild,4,FALSE
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,3,normal,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal,4,FALSE
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 4 |-----------------------------------
max   : 0.52651565
lives : 4

candidate: 1,overcast
candidate: 1,overcast,2,cool,2,mild
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 5 |-----------------------------------
max   : 0.52651565
lives : 3

candidate: 1,overcast
candidate: 1,overcast,2,cool,3,normal
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,3,normal,4,FALSE
candidate: 2,cool,4,FALSE
candidate: 2,mild
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 6 |-----------------------------------
max   : 0.52651565
lives : 2

candidate: 1,overcast
candidate: 1,overcast,2,cool,2,mild
candidate: 1,overcast,3,normal,4,FALSE
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,3,normal,4,FALSE
candidate: 2,mild
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 7 |-----------------------------------
max   : 0.52651565
lives : 1

candidate: 1,overcast
candidate: 1,overcast,3,normal,4,FALSE
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal
candidate: 2,mild,3,normal,4,FALSE
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

---| round 8 |-----------------------------------
max   : 0.52651565
lives : 0

candidate: 1,overcast
candidate: 1,overcast,4,FALSE
candidate: 2,cool
candidate: 2,cool,2,mild
candidate: 2,cool,2,mild,3,normal
candidate: 2,cool,2,mild,3,normal,4,FALSE
candidate: 2,cool,2,mild,4,FALSE
candidate: 2,cool,3,normal
candidate: 2,cool,3,normal,4,FALSE
candidate: 2,cool,4,FALSE
candidate: 2,mild
candidate: 2,mild,3,normal
candidate: 2,mild,4,FALSE
candidate: 3,normal
candidate: 3,normal,4,FALSE
candidate: 4,FALSE

score[ 2,cool ] = 0.13229759
score[ 2,cool,3,normal ] = 0.13260519
score[ 1,overcast ] = 0.17629991
score[ 3,normal,4,FALSE ] = 0.17631895
score[ 2,cool,2,mild,4,FALSE ] = 0.20504833
score[ 2,mild ] = 0.22873855
score[ 2,cool,2,mild,3,normal ] = 0.28613204
score[ 3,normal ] = 0.37500686
score[ 4,FALSE ] = 0.4038298
score[ 2,cool,2,mild ] = 0.52651565

Some constants:

BEGIN  {
    Goal="yes";     # Best = the Goal class. Rest = anything else
    Seed=1          # Random number see.
    More = 1.02;    # Improvement means at least a 2% growth
    Lives=5;        # If no improvement after five rounds, give up
    Dull=0.1;       # Ignore candidates with score < Dull*MaxScore
    Beam=10;        # Only the top (say) 10 candidates survive to the next round
    Samples=20;     # Pick this number of pairs of candidates from the last round
    Pinch = 1/1000; # Add a random number of up to "Pinch" to each score
    OverFitted=3;   # When do we prune a rule that matches on too few instances?
    CONVFMT="%.8g"; # Increase the string size for array contents so we can see the Pinch
    IGNORECASE=1; 
    _ = SUBSEP;
    C=","
}

Data entry. Pretty routine stuff.

/@attribute/ {Name[++Name[0]]=$2}
             {gsub(/[ \t]*/,"")} # no blanks 
             {gsub(/#.*/,"")}    # no comments
/^$/         {next}              # no blank likes
/@data/      {In=1;FS=","; srand(Seed)}
/@/          {next}
In           {Rows++; 
              train(All,Best,H,Rows,Data,$NF==Goal)}

function train(all,best,h,rows,d,class,   i) {
    h[class]++
    for(i=1;i<=NF;i++)  {
        if ($i == "?") 
            continue;
        if (i == NF) {
            d[rows,i]=class
        } else {
            d[rows,i]=$i
            all[i,$i]++
            if (class)                
                best[i,$i]++ }}
}

Now we can begin. In round0, offer a rough ranking of the ranges. In subequent rounds, randomly select and combine ranges from prior randoms.

END   {round0(All,Best,H,Hot0); # make some initial guess
       rounds(1,0,Lives,Hot0,Rows,Data,Out)
      }

In round one, score by a small variant of bayes. Given two classes (best=b and rest=r), we could use b/(b+r) but this can be confused by a low-frequency problem (when "b" is, say, twice as common as "r" but "b" and "r" occur very rarely in the data).

So we'll add a support term, by squaring "b" (if "b" is large, then we see more of "b") to evaluate by b^2/(b+r).

function round0(all,best,h,hot,  i,b,r,memo,score) {
    for(i in best) {
        r = (all[i] - best[i])/h[0]
        b = best[i]/h[1]
        if (b > r) {
            s = b^2/(b+r) + rand()*Pinch
            memo[s] = i
            score[i]= s 
        }}
   chop(score,memo,hot) # prune the dull candidates
}

Given some score[key]=number and memo[number]=key, sort the scores and return the top Beam number of keys, pruning all keys less than Dull times the max score.

function chop(score0,memo,out,   score,n,i) {
    n=asort(score0,score)
    for(i=n;i>=1;i--) {
        if (score[i] <= score[n]*Dull)
            break;
        if (i <= n - Beam)
            break
        out[memo[score[i]]] = score[i]
    }
}

In subsequent rounds one, score all the candidates by running that combination over the data (see the "score" function. Note the "score" paramter that caches prior calcuations of the score. This speeds up the code by a factor of four (for large data sets).

function rounds(round, max0,lives,hot0,rows,data,out,score,   \
                max,i,sample,hot1,s,memo,hot2) {
    if (round == 1)
        max=0
    else { # terminate if we have stopped improving
        max = most(hot0)    
        lives =  (max > (max0*More)) ? Lives : lives - 1
        if(lives < 0) { # if termination, copy input to out
            for(i in hot0)
                out[i] = hot0[i]
            return max }
    }
    print "\n---| round: " round " seed: " Seed " |-----------------------------------"
    print "max   : " max "\nlives : " lives 
    normalize(hot0)           # make all the counts n= 1..100
    explode(hot0,sample)      # copy items n times
    twos(sample,Samples,hot1) # pick items at random from that sample
    for(i in hot0)            # add in the last rounds' ideas
        hot1[i] = i;
    values(hot1,"candidate")
    for(i in hot1) {          # score the new picks and the last rounds's ideas
        s = (i in score) ? score[i] : score(i,rows,data) + rand()*Pinch
        memo[s] = i
        score[i] = s
        }
    chop(score,memo,hot2)     # prune the dull candidates
    o(hot2,"score","-n -k 5")
    return rounds(round+1,max,lives,hot2,rows,data,out,score)
}

Randomly pick pairs and combine them. Note that, in the following code, the picks come from "sample" where an item may be repeated many times (so things that are often repeated are more likely to be picked).

"n" times, pick two things from "sample" and store them in "sampled"

function twos(sample,n,sampled,  pair) {
    while(n--) {
        pair= two(sample)
        sampled[pair]=pair
        }
}

Pick two things at random. Try not to pick the same thing twice. Return the combination of the two things you pick.

function two(sample, tries, this, that) {
    this = one(sample)
    if(tries == 9) # nine lives
        return this
    that = one(sample)
    if (this == that)
        return two(sample,tries + 1)
     else
        return combine(this,that)
}

Combine two rules. don't repeat any ranges. sort them so that all the ranges of the same feature fall together.

function combine(this,that,   used,tmp,out) {
    split(this "," that,tmp,",")
    n=asort(tmp)
    out = tmp[1]
    used[tmp[1]]=1
    for(i=2;i<=n;i++)
        if (!used[tmp[i]]) {
            out = out "," tmp[i]
            used[tmp[i]] = 1
            }
    return out
}

Score a rule by finding its matching rows in data. Given "this" of the form "f1_v1,f2_v2,...." see if "row" matches "this". Assumes that disjunctions are modeled as configuous values from the same feature (this is gaurenteed by "combine"). Hence, whenever we move to a new feature, we need to check that at least of the values mentioned with the old feature was found.

function matched(row,cols,data,this,    col,n,goals,pair,f0,f,status) {
    n=split(this,goals,",")
    for(col=1;col<=n;col++) {
        split(goals[col],pair,_)
        f = pair[1]
        status[f] += data[row,f] == pair[2]
        if (f0 && (f != f0) && !status[f0]) 
            return 0
        f0 = f
    }
    return status[f]
}

Here are some interesting utils:

Given an array a[i]=n, fill up "out" with "n" number of "i". This creates a "sample" of things, from which we can pick randomly biased by the relative frequencies of "n". The total size of the sample is stored in "sample[0]"

function explode(a, out,i,j) {
    for(i in a)
        for(j=1;j<=a[i];j++)
            out[++out[0]] = i
}

Pick any item at random from "sample". Assumes that the same size is in array element "sample[0]"

function one(sample, any) {
    any = int(rand()* sample[0])+1
    return sample[any]
}

Boring utils

Given an array a[i]=num, normalize all the numbers to integers 0..100

function normalize(a, i,sum) {
    for(i in a) sum += a[i]
    for(i in a) a[i] = int(100*a[i]/sum)
}

Combine an feature/ range

function fv(f,v) { return f _ v }

Find the max item in an array

function most(a,   i,max) {
    max = -1000000000
    for(i in a)
        if (a[i] > max)
            max = a[i];
    return max
}

print array values

function values(a,s,what,  i,j,com) {
	print ""
    what = what ? what : ""
    com = "sort " what
    for(i in a) {
        j=a[i];
        gsub(_,",",j);
        print s": " j | com;
    }
    close(com)
 }	

Print an array, sorted by "what"

function o(a,s,what,   i,j,com) {
   print ""
    what = what ? what : ""
    com = "sort " what
    for(i in a) {
        j=i;
        gsub(_,",",j);
        print s"[ "j" ] = " a[i] | com;
    }
    close(com)
 }