Wednesday, April 1, 2009

All About Spam: The Case of the Productless Spam

All About Spam is a series of blog posts about common spammer techniques. Have a question about a type of spam that you'd like to see in a future blog post? Leave a comment, or send an email to pobox@pobox.com!

The classic spam is a smoking gun, easy to spot. Viagra. University diplomas. My new favorite, the acai berry. But some messages have a twist; they don't appear to be selling anything at all! I received the following email today:
From: hfkunm@winartproje.com
Subject: NYC judge denounces woman's self-styled sting

Militants Attack NATO Terminal In Pakistan
hfkunm and I are not best buds. That's the whole message; it doesn't even have a link in it. Aren't spammers supposed to be selling me something? So, why did a spammer bother sending me this message?

Elementary, my dear readers! The first reason is simple: they could be probing for valid email addresses.

The second reason: they're trying to beat the system. In 2002, Paul Graham popularized a plan to filter spam using all your spam and all your ham (legitimate mail) to generate a giant word list, known as Bayesian filtering. Each word would be given a score, based on how frequently it appeared in spam vs. ham. The idea had two key points:
  • it would learn about new spam words as they were introduced
  • "good" words could offset "bad" words
Good words are words that appear, proportionally, way less often in spam. For example, spammers rarely talk about themselves in the first person, so "I" or "I'm" has a negative spam score. Spammers do want you to click on links, so the word "click" has a positive spam score.

So, how does it all work? Well, let's take that most popular of all spam words, Viagra. Your gossipy friend sends you a message all about herself, and it happens to include "I hear Joe started taking viagra!" A keyword-based spam filter will block any message that contains "viagra", so out it goes. A Bayesian filter would say, all these "I"s outweigh the the one "viagra", and let it through.

For a short while, Bayesian filters were all the rage, and very effective, because they were trained per user. Spammers never let a good plan get them down, though, and came up with a simple, ingenious solution: start sending random content. In the early days, it was snippets from great books (read David Copperfield one paragraph at a time!). They've since moved on to simple randomized phrases, and headlines like today's. All these red herrings have certainly degraded the accuracy of Bayesian filters, but like a good detective, spam filters try all the tools in their arsenal, hoping to find the one that closes the case.

------
Do you love sending email so much it hurts? See some simple stretches to relieve carpal tunnel syndrome pain.

1 comment:

  1. One of the very good things you've put in place is to whitelist recipients we email thru your SMTP.
    Another good thing is greylisting of mail servers: please come back in 5 minutes, and if they do come, you whitelist them.

    Christophe

    ReplyDelete