It’s time to do something about foreign spam

Currently the only spam that makes it into my inbox (aside from the odd 419 here and there) is foreign spam.

It seems like these spams don’t end up in SURBL/URIBL or in SpamHaus SBL, possibly because of low visibility by those places.

I imagine that creating some sort of general rule for these is going to be pretty hard. They all tend to be in fairly normal character sets (ISO-8859-1 or one of the Windows-* types) and so I’m going to have to do some level of language analysis.

One thing that seems fairly consistent in Spanish/Portugese spams is the use of “e” as a word on its own. However given the number of geek mailing lists I’m on that might come up as a variable name (or the mathematic constant), so that alone won’t be enough.

Another option is some sort of heuristic language detection like “TextCat”, but that won’t work terribly well for these as a lot of them are mostly images.

Any suggestions here would be most welcome.


