How to stop spam – overview

Nobody wants to deal with spam, and as such I see more and more ineffective spam blocking techniques, especially on web sites, that end up being more of an inconvenience to the user than anything else.

So, I thought a guide on how to effectively block spam (I receive almost none now) while keeping your applications usable was in order.

Let’s dive right in (we’re trying to cut down on useless stuff to read, after all).

There are 3 steps to preventing spam:
1) Define “spam”
2) Define the criteria that identify spam according to your definition.
3) Implement a system to enforce that criteria.

1) Define “spam”

The biggest problem most poorly implemented spam systems have is that they haven’t clearly defined what they consider “spam” to be. Sure, you can rattle off catch phrases like “UCE”, but unsolicited email that’s selling you stuff isn’t really an effective definition, since it doesn’t include random book clippings and it does include an ad offering you a 90% discount on your favorite software from a new computer store that opened near you last week.

The true definition of “spam” will vary based on your application. We’ll look at some examples below, but the fundamental point is: you need to spend some time thinking about what spam actually is in your particular situation.

2) Define the criteria that identify spam according to your definition

In order to filter out the spam, you need to have specific criteria to identify what stays and what goes. Think of this as a manual process first (if you’re reading an email, what defines what you delete versus what you keep? If you’re setting up a web app, what users would you delete versus keep?)

3) Implement a system to enforce that criteria

This is probably an automated system, but could even include manual filtering if appropriate (not that I can think of such a situation off hand). You’ve already done this somewhere, such as turning on your email client’s junk mail filter. But without having done steps 1 and 2, you may be filtering ineffectively, inconveniencing yourself or your users needlessly, and possibly missing spam and deleting “ham”.

So where you’re filtering spam in any form, analyze it through all 3 steps, and see how it changes your spam handling.

Putting it into practice

Theory is nice, practical application is better, so let’s look at a few examples:

Email

I’ll start here since it’s obviously the biggest, and original, problem area for spam.

What defines a “spam” email message?Before you start trying to answer, consider the scope of the definition. To you, messages about “enlarging your pen1s” and “v1agra” (or even “Viagra”) may be spam, but to the older guy with the subscription from an actual Canadian pharmacy, or to the snl comedian writing a “top 10 spammers” sketch, they’re not.

So, how do we block what we can’t define? We don’t. We figure out who needs to define it and we automate that definition.

Defining spam in email must be done at the user account level.

This has several important implications:

  1. It limits ISP-level spam prevention.
    Filtering all incoming email doesn’t really work at the ISP level. You can block known spam relays and other bad emailing practices, but can’t really apply a system-wide bayesan filter.
  2. Your email client’s filter may actually be the best way for you to stop spam.

The best spam prevention you could use for email would be a server-side bayesan filter that was trained by each user. Ideally, that filter would be accessable by every email account the user had (or they’ll have to re-train it for every email account). The filtering system should also be aware of the user’s address book to use a white list and be able to use DNS block lists and SPF to handle misbehavng systems in general.

In practice, this is tricky to impossible for the average user. However, you can come close by running spamsieve on a mac and having it filter all your IMAP mail. There are many tricks and implementation ideas available on spamsieve’s help and web site. Using it, I receive over 99% accuracy on my spam filtering, with no false positives. It’s not 100% ideal, since it allows spam to get to my server, creating a race condition if I’m running multiple IMAP clients (which of course I do). But if you’re a one-client or POP type of user, this is a quick and effective method to accurately nuke your spam. Sysadmins with their own mail servers can run SpamAssassin and DNS blacklists, etc. I recommend syncing your address book as a whitelist as well, and *not* blocking ip address ranges unless your spam definition supports it.

Web applications

(or “why CAPTCHAs don’t work and what to do instead”)

CAPTCHAs are my biggest pet peeve. CAPTCHAs are popping up on practically every web page out there, and 1 day of programming an open source distributed work flow program would render them ineffective (and I’ve been really tempted to write it).

“but it stops spam bots from accessing the site”, you say. No, it makes your users have to type in extra stuff that is frequently illegible and wastes their time. It also inconveniences spammers in the same way. So, you’re hoping that your users have more to gain than the spammers. (hint: the spammers probably have more to gain than your users). Also, a CAPTCHA is designed to “make sure a human, not a computer, is accessing a site”. Last I checked, humans can’t access a web site without a computer, so your goal is not really to make sure computers can’t access your site.

Instead of resorting to a CAPTCHA, take a moment to define what spam is for you. Let’s look at some examples:

New user accounts

If you were manually talking to or emailing with each user, how would you decide who was legit and who was a spammer? If you have a paid service, perhaps anyone who is paying is acceptable. If your profits are ad-based, you want to make sure people are seeing (or clicking) ads on your site. Did you notice that last one? That includes people who are blocking ads in their web browser – the captcha won’t stop that, but you probably *do* want to stop them because they’re just stealing your bandwidth! In that case, instead of using a captcha, refuse connections from systems not requesting your ads. Think of criteria that, as you manually look through accounts, identify fake users on your service. Ask cross-referencing questions if you have too. Check with other sites. Do whatever you’d do manually and program it. Then deactivate those accounts. Have your signup process include validation that meets your criteria. For example, validate email addresses, require and validate credit card info for a free trial if appropriate. Don’t try to second guess what someone will do to abuse your service. Think of what a valid user will do, and what your goal for the site is, and qualify the user accordingly. Remember the basic rule of good sales: qualify your customer. You don’t want everyone, you only want good customers. Screen the rest up front. It’ll help save you and them time, effort, and frustration. It’ll also screen your spammers.

In-service messages and comments

This is a lot like email filtering, and you have even more control because you “know” the sender as well as the receiver, and can remove messages after they’ve been sent. Hopefully you’ve already screened your users too. Ideally, you could tap into the user’s centralized spam filtering system (the same dream system they set up for their email above). But realistically you’ll need to have users identify spam, apply user-specific bayesan filters, decide what constitutes use you want to prevent (this one is tricky), and block that use. Or, define use you want to allow (how does it benefit you to let them send or read this message?) and block use that does not comply. You may need to think about your service here more than just about your users. Do you really care if a user sends a message to every fan of a certain band inviting them to listen to their music, or telling them that the band has a concert coming up in their area soon? What if the user has 10 people logged in sending those messages manually and looking at your ads? What if they’re using a program, but the program displays the ads to people as it runs? What if the program clicks the ads? (careful, that’ll make your advertisers mad, but think about it). Think of these things first without placing technical (or other) limits on them. Worry about how to implement them in step 3. That way you’re not stifiling your ideas, and will probably come up with things you didn’t think you could do, or wouldn’t have thought of doing, otherwise.

Summary

There are many many possible applications of this technique – far too many to discuss in a single article. If you have an area in which you’re working on preventing spam, post a comment and I’ll try to write it up as an example scenario.