Tue, 12 Dec 2006
Statistical SpamAssassin
The increased amount of spam in the last few months (this month I had over 100 000 spams) made me reconsider my antispam strategy. I use a similar approach to the one described by Milan Zamazal in his article at LinuxZone: Two statistical filters (in my case BogoFilter and CRM114), each of them learned by a different method. However, with spams constructed as an excerpt from the FreeBSD mailing list, accomodated by the image with the (OCR-obfuscated) text of the spam message, this is sometimes not sufficient.
I have discovered that I can clasify those "unsure" messages just by looking at
their line in mutt's inbox view
- they have a different color telling me that my statistical filters
did not agree with each other, and from the color, the
sender's name and the Subject line, I can classify it. So I had an idea: add
a third statistical filter, learned from the From:
and Subject:
lines, mime-decoded and written in UTF-8.
So far it seems all messages classified by the previous filters as "unsure"
were correctly classified by my Subject+From filter.
However, there is a problem: I need to change the whole system of how my statistical filters work: I now have three filters instead of two, and I have to compute the final value from all of them (and probably automatically re-learn the one which did not agree with the rest). And I can imagine that in the future I can add more and more heuristics the same way I have added my Subject+From filter (filtering based on the first few lines of the message only, or a bayesian filter with different tokenization - making a token from the two adjacent words instead of one for defending against the text composed of randomly choosen words). So how to add more and more heuristics, some of them with the possibility of training?
SpamAssassin would probably be a first choice. However, manually adjusting the weight of SA's rules is not easy, and the result is not immediately visible. Also I have no idea about how good my statistical filters are. And the last problem - there surely is a difference when Bogofilter says "This message has probability 47.9% of being a spam" than when the probability is 0.01%. So the best way probably would be to relearn the particular statistical filters, and probably adjust the total weights of those statistical filters against each other. SpamAssassin is not good for this, because it allows binary rules only, and it allows rules with insane amount of weight (such as blacklists/whitelists).
Maybe I should use some kind of likelyhood ratio function, like in the Naive Bayes classifier, to evaluate the different heuristics or statistical methods together. [ Likelihood is a function with values from -∞ to ∞, where zero means "unsure". These values can be summed to compute the total likelihood, and multiplied by a constant to give them a different weight. ] Using this would allow me to "meta-train" the whole filter, adjusting weigths of different heuristics.
What do you think about these two ideas (Subject+From filter, and likelihood-based evaluation of different heuristics or statistical filters), my dear lazyweb? How the future spam detection software should work?