Intelligent Spam Filtering

Jeremy Buchmann

Introduction

Most people can instantly recognize spam when it comes into their mailbox. It's annoying and sometimes offensive. Getting a computer to recognize spam is an increasingly popular goal, and hopefully one that will result in spam filters that are easy to use, free, and very effective. If spam filters become prevalent enough and accurate enough, spam will become an ineffective marketing tool.

Quick Index:

How Spam Filtering Works
What I'm Doing
Links - Links to publications and other intelligent anti-spam resources.

How Spam Filtering Works

Naive Bayes

Content Coming Soon

Neural Networks

Content Coming Soon

What I'm Doing

Spam Feature Selection Using a Genetic Algorithm

Word-based spam filters generally use all of the words in an email for classification, or they apply a crude set of rules to obtain a smaller set of words. For example, the filter may use a rule such as "Words are unbroken strings of letters greater than 3 characters and less than 15 characters." While rules like this are usually good enough, it will miss some helpful words. For example, this rule would eliminate the string "HGH" because it only has three characters. HGH (Human Growth Hormone) is a common term in spam email I receive, and by eliminating it from classification, my filter would lose a lot of classification power. In CS790K this semester, I am working on using a genetic algorithm to determine which words are important (or conversely, which words are not important) for classification. I expect to be able to extend the performance of word-based classifiers with this technique, but not by a tremendous amount.

Here is my paper which describes the results I obtained (gzipped PostScript).

Progress:
12/13/2002 - I'm done with my experiments and the paper is complete. Read it to learn the results.
11/25/2002 - I've been working on choosing a smaller subset of emails to use to train the neural net to get the training time down. I've got it down to 6 minutes (it was about 40).
11/18/2002 - I wrote the neural net trainer and recognizer programs. It's taking a long time to train, which worries me.
11/13/2002 - I have written the Perl/shell scripts to convert an email file into an ordered wordlist and individual email files.

Spam Image Detection

Click here to get my paper on this technique.

As more and more email programs are released with anti-spam features, spammers are trying to find ways around them. One of the common techniques they use is to send the whole email as one or more image files with the text they want you to read as a part of the image. In CS791Y this semester, I am working on a way to determine whether or not images attached to emails can be classified as spam or non-spam. Ideally, my program would be able to recognize the difference between images sent in spam and images sent for legitimate purposes, i.e. photos sent by friends, family, etc. I will be using a PCA-like approach, which is commonly used in computer vision. My tentative hypothesis is that it is possible to classify spam images, but it may catch some non-spam also.

To illustrate what I mean, look at the following images:

All of these images have been pulled from various spam emails I have received. It's obvious, even without the rest of the email, that these images come from spam.

Now look at these images:

These are images I have received from family or friends (I have scaled them down to save space). They are all photographs and are distinctly different from the spam images. I am hoping that my program will be able to make the distinction between these two types of images as quickly and easily as humans can.