posts - 221,  comments - 2988,  trackbacks - 320

Outlook 2003 comes with an integrated junk-mail filter that helps to prevent spam and other scums. I've spread my email addresses too much over newsgroups, forums, blogs, etc, and I use to receive between 200 and 300 emails a day. Most of them are spam. Almost of all them are not stopped by Outlook 2003 junk mail filter.

For this reason I started to investigate about filtering mail to detect spam. I saw an article by Paul Graham called "A Plan for Spam" that talked in early 2000 about Bayesian algorithms for the detection of spam. Bayesian algorithms are based on assigning a probability from the existing knowledge to any new item that comes to the filter. I've created my own implementation of this kind of algorithms, and optimized lots of things to improve the spam detection. Finally, created an Outlook 2003 Add In and plugged everything in to test it with my incoming mail. Results where amazing after my filter was trained. Almost 99.5% of my spam is stopped now.

Of course, I want to share this piece of bytes what all the community to help stop spam for everyone. It's totally free. It's fully developed in C#.

Here is what my solution includes:

  • Bayesian mail filtering for Outlook 2003 with self-learning mechanism
  • Creation of corpus both from Junk mail folder for spam corpus and Inbox folder for no-spam corpus
  • Check of every incoming message through the filter
  • Precreated spam corpus with most common spam words and most common messages
  • Statistics of spam received and stopped by the filter
  • Actions to mark as spam and no-spam messages that had already been processed from the Outlook 2003 toolbar
  • Cleaning of folders through the filter, allows you to process a full folder and clean it of spam
  • Auto-update of the filter over the web
  • Logging of every action and exception performed by the filter
  • Moves every detected spam to the Outlook Junk Mail folder, no mail is deleted, just moved

It requires Outlook 2003 and .NET Framework 1.1 ... (adapting the filter to work in Outlook 2000 and 2002) This is a beta version of the filter... more features are being added in this time. And I'm open to your suggestions, if you find it interesting, useful, any bugs, or anything you would like to comment on it.

Some Tips:

  • The filter is based on analyzing your own email probabilities (because everyone emails are different), so it would be good to fill both spam and no-spam corpus with your own definitions (done automatically by the add-in) ... so please say Yes when the filter asks you to do so
  • If you have the folder Junk Mail with lot of the spam you use to receive, delete the spam file that comes with the filter prior to running outlook and importing yours, this will improve your spam detection. The file is called corpus1.dat and is located in the Data subfolder of the filter main folder (by default c:\program files\matador for outlook)
  • Once your databases are populated, keep training the filter... the more training you provide, the most accurate it will be... to train the filter, use the buttons Is Spam and Not Spam when it makes an error (spam not moved to junk mail, or valid mail moved to junk mail)
  • Deactivate the Outlook Junk Mail filter from Tools -> Junk Mail -> Options and let Matador Spam Fighter process all incoming mail

 

Ok... The filter is called Matador Spam Fighter and you can download it here ... Test it and give me your feedback.

posted on Thursday, September 16, 2004 1:24 PM