I thought I had spam pretty much under control, with only about one getting though every few days. And then came image spam. No suspicious words, just a load of bayes poison and an image to carry the actual message. Half of my antispam arsenal was suddenly rendered useless. I was back to suffering one or two spam messages every day.

“The level of image spam has increased dramatically this year,” says Carole Theriault, a senior consultant at Sophos cited by New Scientist. According to New Scientist, Sophos estimates that, at the beginning of the year, image spam accounted for only 18% of unsolicited mail but that this has since risen to 40%.

Less impressive but much more useful than statistical FUD from a biased source, were the few articles about using optical character recognition to fight image spam, from Debian Administration and Linux Weekly News among others.

But with a production server on my hands and precious little time to maintain it I wished to stick to packages distributed by Debian so I waited a little longer for packaging while my users suffered. To my great relief, Christmas came a few days earlier this year – FuzzyOcr hit Debian unstable yesterday ! Mail server administrators rejoice ! Somebody must have been even more pissed off than me about image spam and decided to make the Debian packaging work…

So here is the FuzzyOcr Debian package blurb, straight from the horse’s mouth :

This Spamassassin plugin checks for specific keywords in image/gif, image/jpeg or image/png attachments, using gocr (an optical character recognition program). This plugin can be used to detect spam that puts all the real spam content in an attached image, while the mail itself is only random text and random html, without any URL’s or identifiable information. Additionally to the normal OcrPlugin, it can do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail.

But a debug log is worth a thousand words, so here is a choice output :

[2006-12-19 12:03:39] Debug mode: Starting FuzzyOcr...
[2006-12-19 12:03:39] Debug mode: Attempting to load personal wordlist...
[2006-12-19 12:03:39] Debug mode: No personal wordlist found, skipping...
[2006-12-19 12:03:39] Debug mode: Analyzing file with content-type "image/gif"
[2006-12-19 12:03:39] Debug mode: Image is single non-interlaced...
[2006-12-19 12:03:39] Debug mode: Recognized file type: 1
[2006-12-19 12:03:39] Debug mode: Image hashing disabled in configuration, skipping...
[2006-12-19 12:03:40] Debug mode: Found word "price" in line
"lmatpriceguaranteeftdenrey"
with fuzz of 0 scanned with scanset /usr/bin/gocr -i -
[2006-12-19 12:03:40] Debug mode: Found word "price" in line
"lmatpriceguaranteeftdenrey"
with fuzz of 0 scanned with scanset /usr/bin/gocr -l 180 -d 2 -i -
[2006-12-19 12:03:40] Debug mode: Found word "viagra" in line
"viaraloomgaooomaoo"
with fuzz of 0.166666666666667 scanned with scanset /usr/bin/gocr -i -
[2006-12-19 12:03:40] Debug mode: Found word "viagra" in line
"viaqrastloomaaomaa"
with fuzz of 0.166666666666667 scanned with scanset /usr/bin/gocr -i -
[2006-12-19 12:03:40] Debug mode: Found word "viagra" in line
"viaqrastloomaarialisnomaa"
with fuzz of 0.166666666666667 scanned with scanset /usr/bin/gocr -l 180 -d 2 -i -
[2006-12-19 12:03:40] Debug mode: Found word "cialis" in line
"viaqrastloomaarialisnomaa"
with fuzz of 0.166666666666667 scanned with scanset /usr/bin/gocr -l 180 -d 2 -i -
[2006-12-19 12:03:40] Debug mode: Found word "valium" in line
"valiumlomgaooantivanmgalo"
with fuzz of 0 scanned with scanset /usr/bin/gocr -l 180 -d 2 -i -
[2006-12-19 12:03:40] Debug mode: Found word "legal" in line
"vlaraloomgaoorlalisomaoo"
with fuzz of 0.2 scanned with scanset /usr/bin/gocr -l 180 -d 2 -i -
[2006-12-19 12:03:40] Debug mode: Starting FuzzyOcr...
[2006-12-19 12:03:40] Debug mode: Attempting to load personal wordlist...
[2006-12-19 12:03:40] Debug mode: No personal wordlist found, skipping...
[2006-12-19 12:03:40] Debug mode: FuzzyOcr ending successfully...
[2006-12-19 12:03:40] Debug mode: Message is spam (score 10)...
[2006-12-19 12:03:40] Debug mode: Words found:
"price" in 1 lines
"viagra" in 2 lines
"cialis" in 1 lines
"valium" in 1 lines
"legal" in 1 lines
(6 word occurrences found)
[2006-12-19 12:03:40] Debug mode: FuzzyOcr ending successfully...

Sweet isn’t it ? And that antispam OCR goodness is just an ‘apt-get install fuzzyocr’ away !

The only parameters I changed in /etc/FuzzyOcr.cf are the following :

focr_verbose 2.0
focr_logfile /var/log/fuzzyocr.log
focr_timeout 16

The two first are self-explanatory : I just want to know what is going on. The original timeout was 12 seconds and I found that it was often too short for my puny server – apparently 16 seconds are more than enough. I restarted Amavisd-new who handles calling SpamAssassin and I was done !

I was afraid that FuzzyOcr would load my host too much but I found my fears unfounded : FuzzyOcr only scan messages which where not recognized yet as ham or spam by other SpamAssassin rules or plugins. So the additional load was not noticeable among all the heavy antispam and antiviral machinery that already operated. FuzzyOcr is full of nice surprises !

As a conclusion I must say that, on our mail server, FuzzyOcr is a complete success. I recommend that you install it as soon as possible !