(This are some quick notes. For more details, see OURMINE.
A Bayes classifier is a simple statistical-based learning scheme.
Advantages:
Assumptions
It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.
weather.symbolic.arff outlook temperature humidity windy play ------- ----------- -------- ----- ---- rainy cool normal TRUE no rainy mild high TRUE no sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no overcast cool normal TRUE yes overcast hot high FALSE yes overcast hot normal FALSE yes overcast mild high TRUE yes rainy cool normal FALSE yes rainy mild high FALSE yes rainy mild normal FALSE yes sunny cool normal FALSE yes sunny mild normal TRUE yes
This data can be summarized as follows:
Outlook Temperature Humidity ==================== ================= ================= Yes No Yes No Yes No Sunny 2 3 Hot 2 2 High 3 4 Overcast 4 0 Mild 4 2 Normal 6 1 Rainy 3 2 Cool 3 1 ----------- --------- ---------- Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 Rainy 3/9 2/5 Cool 3/9 1/5 Windy Play ================= ======== Yes No Yes No False 6 2 9 5 True 3 3 ---------- ---------- False 6/9 2/5 9/14 5/14 True 3/9 3/5
So, what happens on a new day:
Outlook Temp. Humidity Windy Play Sunny Cool High True ?%%
First find the likelihood of the two classes
So, we aren't playing golf today.
More generally, the above is just an application of Bayes' Theorem.
Pr(E | H ) * Pr(H) Pr(H | E) = ------------------- Pr(E)
Pr(E1 | H )* Pr(E2 | H ) * .... *Pr(En | H ) * Pr(H ) Pr(H | E) = --------------------------------------------------- Pr(E)
Outlook Temp. Humidity Windy Play Sunny Cool High True ?
Pr( yes | E) = Pr(Outlook = Sunny | yes) * Pr(Temperature = Cool | yes) * Pr(Humidity = High | yes) * Pr( yes) Pr(Windy = True | yes) * Pr(yes) / Pr(E) = (2/9 * 3/9 * 3/9 * 3/9) * 9/14) / Pr(E)
Return the classification with highest probability
From multiplication of lots of small numbers
Missing values are a problem for any learner. Naive Bayes' treatment of missing values is particularly elegant.
Example: Outlook Temp. Humidity Windy Play ? Cool High True ?%%
What if an attribute value doesn't occur with every class value (e.g. "Humidity = high" for class "yes")?
So use an estimators for low frequency attribute ranges
And use an estimator for low frequency classes
Magic numbers: m=2, k=1
And we'll return to the low frequency problem, below.
Here's the pseudo code of the the Naive Bayes classifier preferred by Yang03 (p4).
function train( i) { Instances++ if (++N[$Klass]==1) Klasses++ for(i=1;i<=Attr;i++) if (i != Klass) if ($i !~ /\?/) symbol(i,$i,$Klass) } function symbol(col,value,klass) { Count[klass,col,value]++; }
When testing, find the likelihood of each hypothetical class and return the one that is most likely.
The (K,M) variables handle low frequency cases.
function likelihood(l, klass,i,inc,temp,prior,what,like) { like = -10000000000; # smaller than any log for(klass in N) { prior=(N[klass]+K)/(Instances + (K*Klasses)); temp= log(prior) for(i=1;i<=Attr;i++) { if (i != Klass) if ( $i !~ /\?/ ) temp += log((Count[klass,i,$i]+M*prior)/(N[klass]+M)) } l[klass]= temp if ( temp >= like ) {like = temp; what=klass} } return what }
The above code assumes that the attributes are discrete. If you have numeric attributes then either discretize the values (sort, group into sets of size 30), or use a Gaussian approximation (usually, discretization beats Gaussians).
The probability density function for the normal (Gaussian) distribution is defined by the mean and standardDev (standard deviation)
Given:
Then:
function mean(sum,n) { return sum/n } function standardDeviation(sumSq,sum,n) { return sqrt((sumSq-((sum*sum)/n))/(n-1)) } function gaussianPdf(mean,standardDev,x) { pi= 1068966896 / 340262731; #: good to 17 decimal places return 1/(standardDev*sqrt(2*pi)) ^ (-1*(x-mean)^2/(2*standardDev*standardDev)) }
For example:
outlook temperature humidity windy play ------- ----------- -------- ----- --- sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no
This generates the following statistics:
Outlook Temperature Humidity ===================== ================= ================= Yes No Yes No Yes No Sunny 2 3 83 85 86 85 Overcast 4 0 70 80 96 90 Rainy 3 2 68 65 80 70 ----------- ---------- ---------- Sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 Overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 Rainy 3/9 2/5 Windy Play =================== =========== Yes No Yes No False 6 2 9 5 True 3 3 ------- ---------- False 6/9 2/5 9/14 5/14 True 3/9 3/5
Example density value:
Outlook Temp. Humidity Windy Play Sunny 66 90 true ?%%
Note: missing values during training: not included in calculation of mean and standard deviation
BTW, an alternative to the above is apply some discretization policy to the data; e.g. Yang03. Such discretization is good practice since it can dramatically improve the performance of a Naive Bayes classifier (see Dougherty95.
Why does Naive Bayes work so well? Domingos97 offer one analysis:
Their three attribute example is given below. For the generalized case, see Domingos97.
Consider a Boolean concept, described by three attributes A, B and C .
Assume that the two classes, denoted by + and - are equiprobable
P(+) = P(-) = 1/2
Let A and C be independent, and let A = B (i.e., A and B are completely dependent). Therefore B should be ignored, and the optimal classification procedure for a test instance is to assign it to (i) class + if
P(A|+) * P(C|+) - P(A|-) * P(C|-) > 0,
and (ii) to class (if the inequality has the opposite sign), and (iii) to an arbitrary class if the two sides are equal.
Note that the Bayesian classifier will take B into account as if it was independent from A, and this will be equivalent to counting A twice. Thus, the Bayesian classifier will assign the instance to class + if
P(A|+)^2 * P(C|+) - P(A|-)^2 * P(C|-) > 0,
and to - otherwise.
Applying Bayes' theorem, P(A|+) can be re-expressed as
P(A) * P(+|A)/P(+)
and similarly for the other probabilities.
Since P(+) = P(-), after canceling like terms this leads to the equivalent expressions
P(+|A) * P(+|C ) - P(-|A) * P(-|C ) > 0
for the optimal decision, and
P(+|A)^2 * P(+|C ) - P(-|A)^2 * P(-|C) > 0
for the Bayesian classifier. Let
P(+|A) = p P(+|C) = q.
Then class + should be selected when
pq - (1 - p)*(1 - q) > 0
which is equivalent to
q > 1 - p [Optimal Bayes]
With the Bayesian classifier, it will be selected when
p^2 * q - (1 - p)^2 * (1 - q) > 0
which is equivalent to
q > (1 - p)^2 * p^2 +(1 - p)^2 [Simple Bayes]
The two curves are shown in following figure. The remarkable fact is that, even though the independence assumption is decisively violated because B = A, the Bayesian classifier disagrees with the optimal procedure only in the two narrow regions that are above one of the curves and below the other; everywhere else it performs the correct classification.
Thus, for all problems where (p, q) does not fall in those two small regions, the Bayesian classifier is effectively optimal.