Sunday, August 23, 2009

Naive Bayes for text classification

Another article on Naive Bayes classification aka nbc? Why does the web need another article on this. Yes, there are many. Just use your favorite browser and search on "naive bayes". You'll get hundreds of url's. However, once you start digging deep into them, you will find 2 annoying trends. They are just reprinting the Naive Bayes definition. No specifics on how the Naive Bayes definition was used to classify a type of data.



Originally, I had leared about Naive Bayes in my AI class, which was taught by Dr. Christopher Brooks, at University of San Francisco back in 2005. Since then, the computer that held my information died and I've been unable to retrieve my nbc program. After a long time, I finally found good info on Naive Bayes and how to use it toward classifying text. First, I'll give the basic definitions of Bayes and Naive Bayes classification, which I'm simply resummarizing from the text book "Artificial Intelligence" by Russel and Norvig. Then, I'll talk about specifically applying Naive Bayes for text classification; This information I found in slides for a course called Comp221. The slides were written by the courses TA, Zhang Kai.

Naive Bayes

Like the regular Bayes algorithm, Naive Bayes simplifies the calculation of a conditional probability. It further simplifies the calculation of conditional probabilities by assuming that the effects are independent. Even though this may not be actually true, it has been found that this assumption yields acceptable behavior.


P(Class|Effects) = P(Class) * P(Effect1|Class).....*P(Effectn|Class)


Supervised Naive Bayes for Text Classification

The definition of Naive Bayes is easy to understand, but is lacking in any of the details that one must use to make a real application. I will fill in the details here(Thanks Dr. Brooks and Zhang Kai!)

(1) Start with a corpus and calculate P(Ci)
A corpus is a collection of data that will be used to train the Naive Bayes classiifer. This should be large number of items. The total number items should be around 1000. The corpus should be split into different classes, where each class occurs in the percentage one thinks the actual documents occur in real life. Ci is a class in C. P(Ci) = nc / ni, where nc is the number of corpus documents that correspond to Ci. ni is the total number of documents in the corpus.

(2)For each class, calculate the P(word|class)
For each class, there will be a collection of words that are associated with that class. One must calculate the probablity that a given word will occur in a particular class.

ni = number of total words in documents in Ci
wi = word associated with Ci
wij = number of times wi occurs in all Ci documents
P(wi|Ci) = wij / ni

For each class, if a word only occurs in the class Ci, this is considered a conditional probability of 'zero'. For an conditional probability that is a 'zero', assign it a value eta/ni, where eta/ni is some tunable value.

For each class, choose the top word frequencies as the words used to classify a document. Ideally, each word would occur in all Ci in C.

(3) After (1) and (2) have been performed, the nbc has been trained. d is a new document that is unclassified.

Take document d and find all the words that occur in the training corpus.

For each Ci, calculate P(Ci|Effects); For each word, wi, calculate the P(wi|Ci) wrt to the document d.

The largest P(Ci|Effects) is the matched class Ci for document d.

No comments: