Bayesian Filtering 101
It is so easy to get caught up in the day-to-day acronyms, features, and usage of the various tech tools we use everyday. Sometimes, there are so many components of a solution that we look at the solution as a whole, rather than learning every single part. An example of this is the Bayesian filter component of our SpamaGator E-mail Gateway appliance. Because it is just one piece of the anti-spam feature set, it is easier to explain what it does, rather than how it does it!
So, I spent some time getting versed of how Bayesian filtering works. In 1764 Richard Price, a friend of Rev. Thomas Bayes, presented a work called Essay Towards Solving a Problem in the Doctrine of Chances, posthumously. Rev. Bayes studied in the field of probability theory. He presented a theory, simply put that the probability of an event was equal to the actual number of an event multiplied by the likelihood of an event, divided by the evidence of the events. I won’t go into the mathematical formulas behind all of this, but if that’s your thing it’s easy to find.
The interesting thing I found was that Bayes Theorem can be applied to many disciplines, including Science, and Mathematics. So how does this apply to our fight against junk e-mail? Well it works like this; a known spam word such as Viagra is generally part of spam e-mail, and generally not in a legitimate e-mail. The Bayesian filter doesn’t know this initially, and must be trained. As the user trains the filter, they indicate if the e-mail is spam or not. Generally because valid e-mail is quite different from spam (for example a spouses name may be in good e-mail etc.) After training, the filter learns the probabilities that certain combinations of words should be categorized as spam or good e-mail.
Because the filter learns, good e-mail may contain things such as names, places or other common words that a user would normally receive, say in the conduct of business, that even if the word Viagra were to appear in the body of an e-mail, if it is found to have more good –email variables, the probability that the message is flagged as spam would be lower.
Spammers are aware of how the filter works and commonly use a technique called “Poisoning” to trip up and potentially damage the Bayesian database. It works by surrounding a spam word with groups of non-spam words. Because of the lower probability based on the non-spam words, the e-mail may fool the filter into believing the e-mail is legitimate. Two techniques are used to prevent Poisoning.
The first technique is to employ a No Processing list. This type of list contains criteria that will prevent the processing of blocking tools during an SMTP session provided all criteria is met.
The other technique is called Redlisting. This list prevents additions to your HAM/Spam or whitelist databases during a SMTP session
Well, after this drawn out explanation, what is truly important to our customers, is that the filter is easy to set up, and that it works well. We have found the Spam Assassin component of our appliance does a great job, and is easily trained. Bayesian filtering is a key filter, and for many, one of the most mysterious.