Language Detection in Haraka

English emails only please!

A while ago someone (somewhere – maybe the Node.js mailing list, I forget) mentioned that they had ported PEAR’s language detection module to Node. This struck me as a great thing to have in Haraka, as I’m continually bombarded with foreign spam which slips through the filters (mostly mainsleaze, and the SBL clearly misses it). There’s only me on my domain, so blocking anything not in English is a reasonable thing to do.

The first step to this though had to be decoding mail body text properly to UTF-8. There’s been a TODO line in Haraka forever to do this. I just never got around to it.

So after work today I made that work. It took a fair bit of hackery as previously I was decoding base64 line by line, but that can easily break encoding routines because you get partial characters. So I modified this to decode the entire thing at once into a single buffer, then it was just a matter of using node-iconv to convert (and cope with errors).

Then language detection was just a matter of finding the right body part – in emails you can have both a text/plain part and a HTML part saying the same thing (or sometimes different things), so in that case I prefer the HTML part as it’s more likely to contain relevant text. Then I strip HTML and pass it to node-languagedetect (available on npm). This returns a list of possibilities and confidences. Of which I just choose the top one. Bingo.

66 lines of code. It’s pretty horrible so I won’t put it into the distro. But come the next release (when the iconv stuff is there) I’ll be putting it to the test live on my server.


