~ Semantic Search Art ~

C# implementation of Naive Bayes classifier test

Naive Bayes

Naive Bayes is, probably, the simplest way of categorization of documents. It requires the training set: documents for which categories are already determined by an expert. There is no shortage of articles explaining theoretical part so I reduce the explanation to a simple example. Let us define our training set as two documents shown below in the table. Presume that each of them belong to different category (cat1 and cat2).

doc/word w1 w2 w3 w4 w5
doc1 8 2 0 0 1
doc2 1 0 0 7 7


We can see that the word w1 can be seen in both categories but we believe that probability for this word to be seen in category 1 is 8/9 and in category 2 is 1/9. On the reason that will be clear later we should not have probability of occurrence of any word to be equal to 0, so we add 1 to every cell, where word count is zero. However, in order to not alter the experimental data significantly we add 1 to every cell in the table.

doc/word w1 w2 w3 w4 w5
doc1 9 3 1 1 2
doc2 2 1 1 8 8


On the next step we convert all frequencies to probabilities within the training set and add another two lines of experimental data, for which we need to determine the category.

doc/word w1 w2 w3 w4 w5
cat1 0.82 0.75 0.50 0.11 0.20
cat2 0.18 0.25 0.50 0.89 0.80
doc3 2 0 1 0 0
doc4 0 0 1 4 1


Now we can calculate the probability for document 3 to be in category 1.

P{doc3 = cat1} = (0.822 * 0.5) / const = 0.3362

The constant in denominator can be ignored. The probability for document 3 to be in category 2 is accordingly

P{doc3 = cat2} = (0.182 * 0.5) / const = 0.0162

The obvious conclusion is that document 3 most likely belong to category 1. Same simple computations for document 4 yield

P{doc4 = cat1} = (0.50 * 0.114 * 0.20) / const = 0.000014641

P{doc4 = cat2} = (0.50 * 0.894 * 0.80) / const = 0.2510

Since real case scenario presumes thousands of words, to compute everything in probabilities is not computationally convenient. On that reason people, usually, use logarithmic scale:

log(P{doc4 = cat1}) = log(0.50) + 4 * log(0.11) + log(0.20) = -4.83

log(P{doc4 = cat2}) = log(0.50) + 4 * log(0.89) + log(0.80) = -0.60

I added one more correction: it is the size of the file. When the file is large it contains significantly larger number of words and probabilities are skewed.

Experimental programming part was not very challenging. As it was expected the result depended on the training set and varied in a very wide range from 30 percent of correct recognition to almost 90 percent. Other methods were not that sensitive for choosing of training set. NB is sort of lottery, it can be better than anything else if you have luck. On that reason it is better to use it as supplementary improvement tool, for example, after HAC, using already categorized files.

In my experiments with PATENTCORPUS128 I found the training sample of 12 files out of 128 that provided highest possible accuracy near 93%. Unfortunately taking random training sample of the same size shows accuracy between 60% and 80%.