Thwarting comment spam

January 15, 2007

Thwarting comment spam

There are a variety of approaches to combatting blog spam. Each has its pros and cons, and none of them is perfect. Three common approaches that I've seen:

1. CAPTCHA's can be effective, but spammers are getting better at finding ways around them. Generally speaking, the better the CAPTCHA is at keeping out spammers, the harder it is for your user to decipher.

2. Bayesian algorithms, keyword filters, and other types of content analysis are good, but spammers are getting smarter. Blog spam is becoming increasingly on-topic, making it harder to filter it out and requiring more manual work.

3. Requiring a login or subscription to a service is a good approach in theory, but in practice spammers have no problem setting up bogus accounts. Then there's the fact that most people don't like creating accounts just to post comments.

Recently I decided to test out a new approach to dealing with comment spam. It occurred to me that most spammers don't necessarily crawl my blog at the same moment they spam it. It seems more likely that they have bots that crawl the web to find comment forms (or even subscribe to services that do this) and then systematically spam it some time later.

So I set up a simple PHP script that ties into Apache's mod_rewrite to manage the URL's for my blog. Basically every day, the URL for the comment form changes. The new URL comtains an effectively random 32 character string, making it impossible to guess the correct URL without actually going to the blog. The end result has been an 80% reduction in comment spam. Combined with MT's junk filters, it has proven to be extremely effective.

The next step will be to adjust the frequency of the URL changes to see how that affects the remaining 20% — if it ends up making a big enough dent, I'll make the source public to give people an additional weapon in the war on spam.

Posted by jon at January 15, 2007 1:12 AM

Comments

You might be interested in my CPR plug-in for WordPress which has a simlar approach stopping comment spam. A little JavaScript adds a pseudo-random string of around 40 to 50 chars to the script where the POST request must be directed to. So if the submitted POST request does not contain the same string in the URL it got blocked. Any included (and it must be the same as found in the JS code) code will we rewritten with mod_rewrite to the real script and the code as a GET variable like: wp-comments-post.php?cpr_code=xxxxx.

xxxxx is a dayling changing and server-depending hash of several blog-unique data. :) Luckily md5() or sha1() can not be reversed no spambot will know e.g. a random string which you have entered and the last modification timestamp in Uni*-format of the script. :)

Hope you will stop by and rewrite it for your blog software. :)

Posted by: Quix0r at January 18, 2007 5:55 AM

Oh, I did not fully read your post.So if you like you can implement some parts of my plug-in in yours.

Please note that JavaScript-generated cookies are maybe blocked in some browsers because of privacy. And HTTP cookies (in header) are easily fake-able. :(

Posted by: Quix0r at January 18, 2007 5:58 AM

Uncle Jon, I don't understand a word you've said. Couldn't you put something for kids on your website?

Posted by: Zack Griffin at February 17, 2007 6:22 PM

Nice write up. I think that one issue is defining what comment spam is, as this seems to mean different things to different people. To a spammer using automated means or a completely irrelevant pre-made template is spamming where a poorly written sort of on topic comment with a link, or 4, is not. On the other end you have people who consider most legitimate comments with a URL entered to be spam.

You need different means for different purposes. Putting the no-follow attribute up would stop many of the "manual spammers" where a captcha would stop most/virtually all automated blog postings.

Joey Jensen.

Posted by: Joey Jensen at August 5, 2008 3:09 PM