LectureNaiveBayes

NaiveBayes Classifiers 101

Introduction

(This are some quick notes. For more details, see OURMINE.

A Bayes classifier is a simple statistical-based learning scheme.

Advantages:

Tiny memory footprint
Fast training, fast learning
Simplicity
Often works surprisingly well

Assumptions

Learning is done best via statistical modeling
Attributes are
- equally important
- statistically independent (given the class value)
- This means that knowledge about the value of a particular attribute doesn't tell us anything about the value of another attribute (if the class is known)
Although based on assumptions that are almost never correct, this scheme works well in practice Domingos97

It has some drawbacks: it can offer conclusions put it is poor at explaining how those conclusions were reached. But that is something we'll get back to below.

Example

weather.symbolic.arff

outlook  temperature  humidity   windy   play
-------  -----------  --------   -----   ----
rainy    cool         normal     TRUE    no
rainy    mild         high       TRUE    no
sunny    hot          high       FALSE   no
sunny    hot          high       TRUE    no
sunny    mild         high       FALSE   no
overcast cool         normal     TRUE    yes
overcast hot          high       FALSE   yes
overcast hot          normal     FALSE   yes
overcast mild         high       TRUE    yes
rainy    cool         normal     FALSE   yes
rainy    mild         high       FALSE   yes
rainy    mild         normal     FALSE   yes
sunny    cool         normal     FALSE   yes
sunny    mild         normal     TRUE    yes

This data can be summarized as follows:

           Outlook            Temperature           Humidity
====================   =================   =================
          Yes    No            Yes   No            Yes    No
Sunny       2     3     Hot     2     2    High      3     4
Overcast    4     0     Mild    4     2    Normal    6     1
Rainy       3     2     Cool    3     1
          -----------         ---------            ----------
Sunny     2/9   3/5     Hot   2/9   2/5    High    3/9   4/5
Overcast  4/9   0/5     Mild  4/9   2/5    Normal  6/9   1/5
Rainy     3/9   2/5     Cool  3/9   1/5

            Windy        Play
=================    ========
      Yes     No     Yes   No
False 6      2       9     5
True  3      3
      ----------   ----------
False  6/9    2/5   9/14  5/14
True   3/9    3/5

So, what happens on a new day:

Outlook       Temp.         Humidity    Windy         Play
Sunny         Cool          High        True          ?%%

First find the likelihood of the two classes

For "yes" = 2/9 * 3/9 * 3/9 * 3/9 * 9/14 = 0.0053
For "no" = 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
Conversion into a probability by normalization:
- P("yes") = 0.0053 / (0.0053 + 0.0206) = 0.205
- P("no") = 0.0206 / (0.0053 + 0.0206) = 0.795

So, we aren't playing golf today.

Bayes' rule

More generally, the above is just an application of Bayes' Theorem.

Probability of event H given evidence E:

                   Pr(E | H ) * Pr(H)
      Pr(H | E) =  -------------------
                        Pr(E)

A priori probability of H= Pr(H)
- Probability of event before evidence has been seen
A posteriori probability of H= Pr[H|E]
- Probability of event after evidence has been seen
Classification learning: what's the probability of the class given an instance?
- Evidence E = instance
- Event H = class value for instance
Naive Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!

                  Pr(E1 | H )* Pr(E2 | H ) * ....  *Pr(En | H ) * Pr(H )
      Pr(H | E) = ---------------------------------------------------
                                     Pr(E)

We used this above. Here's our evidence:

      Outlook       Temp.         Humidity    Windy         Play
      Sunny         Cool          High        True          ?

Here's the probability for "yes":

      Pr( yes | E) = Pr(Outlook     = Sunny | yes) *
                     Pr(Temperature = Cool  | yes) *
                     Pr(Humidity     = High  | yes) * Pr( yes)
                     Pr(Windy       = True  | yes) * Pr(yes) / Pr(E)
                   = (2/9 * 3/9 * 3/9 * 3/9)       * 9/14)   / Pr(E)

Return the classification with highest probability

Probability of the evidence Pr(E)
- Constant across all possible classifications;
- So, when comparing N classifications, it cancels out

Numerical errors

From multiplication of lots of small numbers

Use the standard fix: don't multiply the numbers, add the logs

Missing values

Missing values are a problem for any learner. Naive Bayes' treatment of missing values is particularly elegant.

During training: instance is not included in frequency count for attribute value-class combination
During classification: attribute will be omitted from calculation

Example: Outlook    Temp.    Humidity    Windy    Play
         ?          Cool     High        True     ?%%

Likelihood of "yes" = 3/9 * 3/9 * 3/9 * 9/14 = 0.0238
Likelihood of "no" = 1/5 * 4/5 * 3/5 * 5/14 = 0.0343
P("yes") = 0.0238 / (0.0238 + 0.0343) = 41%
P("no") = 0.0343 / (0.0238 + 0.0343) = 59%

The "low-frequencies problem"

What if an attribute value doesn't occur with every class value (e.g. "Humidity = high" for class "yes")?

Probability will be zero!
Pr(Humidity = High | yes) = 0
A posteriori probability will also be zero! Pr( yes | E) = 0 (No matter how likely the other values are!)

So use an estimators for low frequency attribute ranges

Add a little "m" to the count for every attribute value-class combination
- The Laplace estimator
- Result: probabilities will never be zero!

And use an estimator for low frequency classes

Add a little "k" to class counts
- The M-estimate

Magic numbers: m=2, k=1

And we'll return to the low frequency problem, below.

Pseudo-code

Here's the pseudo code of the the Naive Bayes classifier preferred by Yang03 (p4).

function train(   i) {
   Instances++
   if (++N[$Klass]==1) Klasses++
   for(i=1;i<=Attr;i++)
     if (i != Klass)
      if ($i !~ /\?/)
         symbol(i,$i,$Klass)
}
function symbol(col,value,klass) {
   Count[klass,col,value]++;
}

When testing, find the likelihood of each hypothetical class and return the one that is most likely.

The (K,M) variables handle low frequency cases.

function likelihood(l,         klass,i,inc,temp,prior,what,like) {
   like = -10000000000;    # smaller than any log
   for(klass in N) {
      prior=(N[klass]+K)/(Instances + (K*Klasses));
      temp= log(prior)
      for(i=1;i<=Attr;i++) {
         if (i != Klass)
            if ( $i !~ /\?/ )
                temp += log((Count[klass,i,$i]+M*prior)/(N[klass]+M))
      }
      l[klass]= temp
      if ( temp >= like ) {like = temp; what=klass}
   }
   return what
}

Handling Numerics

The above code assumes that the attributes are discrete. If you have numeric attributes then either discretize the values (sort, group into sets of size 30), or use a Gaussian approximation (usually, discretization beats Gaussians).

The probability density function for the normal (Gaussian) distribution is defined by the mean and standardDev (standard deviation)

Given:

n: the number of values;
sum: the sum of the values; i.e. sum = sum + value;
sumSq: the sum of the square of the values; i.e. sumSq = sumSq + value*value

Then:

    function mean(sum,n)  {
        return sum/n
    }
    function standardDeviation(sumSq,sum,n)  {
        return sqrt((sumSq-((sum*sum)/n))/(n-1))
    }
    function gaussianPdf(mean,standardDev,x) {
       pi= 1068966896 / 340262731; #: good to 17 decimal places
       return 1/(standardDev*sqrt(2*pi)) ^
                    (-1*(x-mean)^2/(2*standardDev*standardDev))
    }

For example:

outlook  temperature humidity windy play
-------  ----------- -------- ----- ---
sunny    85          85        FALSE no
sunny    80          90        TRUE  no
overcast 83          86        FALSE yes
rainy    70          96        FALSE yes
rainy    68          80        FALSE yes
rainy    65          70        TRUE  no
overcast 64          65        TRUE  yes
sunny    72          95        FALSE no
sunny    69          70        FALSE yes
rainy    75          80        FALSE yes
sunny    75          70        TRUE  yes
overcast 72          90        TRUE  yes
overcast 81          75        FALSE yes
rainy    71          91        TRUE  no

This generates the following statistics:

             Outlook           Temperature               Humidity
=====================    =================      =================
           Yes    No             Yes    No            Yes      No
Sunny       2      3             83     85             86      85
Overcast    4      0             70     80             96      90
Rainy       3      2             68     65             80      70
          -----------            ----------            ----------
Sunny     2/9    3/5    mean     73     74.6  mean     79.1   86.2
Overcast  4/9    0/5    std dev   6.2    7.9  std dev  10.2    9.7
Rainy     3/9    2/5

              Windy            Play
===================     ===========
           Yes   No     Yes     No
False       6     2      9       5
True        3     3
            -------     ----------
False     6/9   2/5     9/14  5/14
True      3/9   3/5

Example density value:

f(temperature=66|yes)= gaussianPdf(73,6.2,66) =0.0340
Classifying a new day:

Outlook    Temp.    Humidity    Windy    Play
Sunny      66       90          true     ?%%

Likelihood of "yes" = 2/9 * 0.0340 * 0.0221 * 3/9 * 9/14 = 0.000036
Likelihood of "no" = 3/5 * 0.0291 * 0.0380 * 3/5 * 5/14 = 0.000136
- P("yes") = 0.000036 / (0.000036 + 0. 000136) = 20.9%
- P("no") = 0. 000136 / (0.000036 + 0. 000136) = 79.1%

Note: missing values during training: not included in calculation of mean and standard deviation

BTW, an alternative to the above is apply some discretization policy to the data; e.g. Yang03. Such discretization is good practice since it can dramatically improve the performance of a Naive Bayes classifier (see Dougherty95.

Not so "Naive" Bayes

Why does Naive Bayes work so well? Domingos97 offer one analysis:

They offer one example with three attributes where the performance where a "Naive" and a "optimal" Bayes perform nearly the same.
They generalized that to conclude that "Naive" Bayes is only really Naive in a vanishingly small number of cases.

Their three attribute example is given below. For the generalized case, see Domingos97.

Consider a Boolean concept, described by three attributes A, B and C .

Assume that the two classes, denoted by + and - are equiprobable

 P(+) = P(-) = 1/2

Let A and C be independent, and let A = B (i.e., A and B are completely dependent). Therefore B should be ignored, and the optimal classification procedure for a test instance is to assign it to (i) class + if

 P(A|+) * P(C|+) -  P(A|-) * P(C|-) > 0,

and (ii) to class (if the inequality has the opposite sign), and (iii) to an arbitrary class if the two sides are equal.

Note that the Bayesian classifier will take B into account as if it was independent from A, and this will be equivalent to counting A twice. Thus, the Bayesian classifier will assign the instance to class + if

 P(A|+)^2 *  P(C|+) -  P(A|-)^2  * P(C|-) > 0,

and to - otherwise.

Applying Bayes' theorem, P(A|+) can be re-expressed as

 P(A) * P(+|A)/P(+)

and similarly for the other probabilities.

Since P(+) = P(-), after canceling like terms this leads to the equivalent expressions

 P(+|A) * P(+|C ) - P(-|A) * P(-|C ) > 0

for the optimal decision, and

 P(+|A)^2 *  P(+|C ) - P(-|A)^2 * P(-|C) > 0

for the Bayesian classifier. Let

P(+|A) = p
P(+|C) = q.

Then class + should be selected when

 pq - (1 - p)*(1 - q) > 0

which is equivalent to

 q > 1 - p   [Optimal Bayes]

With the Bayesian classifier, it will be selected when

 p^2 * q  - (1 - p)^2 *  (1 - q) > 0

which is equivalent to

q >  (1 - p)^2 *  p^2 +(1 - p)^2    [Simple  Bayes]

The two curves are shown in following figure. The remarkable fact is that, even though the independence assumption is decisively violated because B = A, the Bayesian classifier disagrees with the optimal procedure only in the two narrow regions that are above one of the curves and below the other; everywhere else it performs the correct classification.

Thus, for all problems where (p, q) does not fall in those two small regions, the Bayesian classifier is effectively optimal.