Wednesday, August 26, 2009

Naive Bayes Classifiers in SpamAssassin

Spam classifiers like SpamAssassin are broadly used to split email into ham and spam. How well would SpamAssassin's nbc perform if there were more than 2 categories. I have an idea for applying nbc of a spam filter for sorting emails which will be split into more than 2 categories.

As a start of this investigation, I've decided to start with some OSS Naive Bayes Classifer based spam filters. I'm starting with SpamAssassin. For the purposes of this experiment, I will be using SpamAssassin as a command-line tool.

spamassassin is a Perl front-end that one uses to classify an email, which is in a text file. One email per file.

sa-learn is a tool in the SpamAssassin suite that trains the nbc.
sa-learn --ham /path/to/directory/containing/ham loads the nbc with ham.
sa-leanr --spam /path/to/directory/containing/spam loads the nbc with spam.

I've only acquired a corpus of ham and spam of a few thousand emails. For what I need, I would like to have a corpus of up to a million documents which could be split into about 9 categories. I'm looking for a large corpus.

I've also noted with nbc's that process text, I've notice that there appears to be no restriction on the email size. In comparison to nbc's used with images, it is required that the images in the image corpus all be the same size. I wonder if this is really necessary. I will have to check that out with the face recognizer work currently in progress in my OpenCV project.

No comments: