Published on 2005-09-07 05:13:46
Bayes Theorem by definition is :
The probability of H conditional on E is defined as PE(H) = P(H & E)/P(E), provided that both terms of this ratio exist and P(E) > 0.
Why Bayasian filter can't stop splogs
The first thing I noticed that its very easy to bypass a bayasian filter, and in the same time very easy to detect a normal content as spam. For example a splog like this one http://mydordrecht.blogspot.com/ will increase the spam score of the keywords : Real Estate, Austin, Texas, Apartment, ... which are common keywords, any blog could be easily tagged splog using these keywords.
Also escaping Bayasian filter is very easy, you can see this blog http://web-traffic-tips.blogspot.com/ all the content looks very well choosed, also its about web traffic tips, so you can guess why !
Also you could experiment this with many blogging software that offer bayasian antispam solutions for comments and you'll see the number of problems that you will find.
When Bayasian filters could be useful ?
The experiment that I have made don't reflect the real things. I could give you more accurate results if I was testing on the Google index. A high number of websites could help to get more correct probabilities. But imagine the number of calculations to do just to find a splog, the database size of bayasian tokens ... !! No thank you, I don't have all this to fight small splogs.
Conclusion
Applying Bayes Theorem to fight spam could be in theory an excellent solution, while in practice many tricks could be used to bypass these implementations and make it fail. While I was looking for fast solution that could help to study the behaviour of "splogs" compared to a "normal website".
Member of the PHP Magazine Network, Copyright (C) 2005-2008 phpmagazine.net All Rights Reserved