brainwagon

Fighting Referer Spam

December 21, 2004 | Blogging, General, My Projects | By: Mark VandeWettering

In the last couple of days, I’ve been targetted by referer spam bots. These dorks access pages on a weblog repeatedly in an attempt to get their referer tag listed on your home page. I’ve been trying to figure out how to combat this behavior, and can see two different ways of dealing with it:

Ping the referer back, and make sure it does link to my site. Probably slow and not scaleable, particularly in the situation I have with asymmetric bandwidth.
Blacklist sites which generate bursts of referer traffic. If we get lots of referers to a particular url in a short period of time, put them in a database of blacklisted sites and keep them from ever appearing in the referer list.

The second seems easy, but I must admit: the query to find such lists seems difficult to write. I’ll continue to think it over, but does anyone have any suggestions?

« Slashdot | Following up on Torrent Shutdowns

How bad is referer spam? »

Comments

Comment from Rich Ozer
Time 12/21/2004 at 5:16 pm

Mark,

Run them through spamassassin. The latest version has URI tests as well as SPF and DNSBL. SA will return a score that you can then use to either accept or block the address. This presumes that the URL’s are also spammy.

Comment from Matt May
Time 12/21/2004 at 5:41 pm

I’ve been getting hammered with these, too. My inclination is to feed blank files to anyone with a referrer at *.info, since that’s where the majority of them appear to be coming from, but I agree, it’s a major pain in the ass. Perhaps somebody in the WordPress community is working on it?

Comment from Rich Ozer
Time 12/22/2004 at 1:02 pm

I guess I’m not the first person to think of the similarities to spamassassin… this might be worth checking out… came off the spamassassin users list today:

Hello all,

Considering the latest press on blog comment spam, I think that it’s
time that we organize a cross-platform project to address the problem.
There are a considerable number of plugins implemented for various blog
software with the intent of reducing blog spam but many are ineffective
or require a tremendous amount of work to maintain (Jay’s mt-blacklist
plugin is definitely the latter).

http://news.netcraft.com/archives/2004/12/17/hosts_disable_movable_type_as_comment_spam_slows_servers.html
http://it.slashdot.org/article.pl?sid=04/12/18/1827225&tid=111&tid=128
http://www.sixapart.com/log/2004/12/more_on_comment.shtml

I propose that we create a subproject of Apache SpamAssassin to
encourage collaborative research in the area of anti blog spam with the
goal of producing cross-platform standards and implementations of
workable comment spam solutions. SpamAssassin’s expertise of anti-spam
in the e-mail domain will complement the knowledge of the weblogging
community.

Here are some of the ideas that I would like to explore further and see
incorporated into standard installations of blogging software:

* Proof-of-work: A legitimate user will take several seconds to minutes
to create each unqiue comment while a comment spammer sends them out as
fast as possible. Consider a proof-of-work algorithm executed within
the browser (e.g. javascript, java, activex) added to comment submission
forms. The weblog software can safely reject all comment submissions
that lack valid proof of work. Legitimate users will not be
inconvenienced by a short delay as they submit their comment while
spammers will not be able to easily submit comments in large volumes.
For example, if a typical comment spammer sends 1000000 comments per day
and the proof of work requires 2 seconds of compute time then they will
need to dedicate 24 machines to proof-of-work computation to maintain
their rate of transmission. The cons of this method are that users
without advanced browsers or older, slow computers may not be able to
post comments.

There is a javascript implementation of Hashcash that can be combined
with SpamAssassin’s hashcash verification and duplicate detection
algorithms to quickly produce a prototype.

* Collaborative filtering: IronPort maintains a database of e-mail
server traffic volumes called SenderBase. Mail servers can use
SenderBase to find “traffic spikes” and potentially block e-mail from
those servers. Something similar could be done for weblogs. As
comments come in, weblogs could report the urls in the comments to a
central server. If an URL is sent in too rapidly, it can be added to a
list of probable spam urls and weblogs can quarantine or delete comments
containing that url.

* DNS-based URI Blocklists: SpamAssassin has had great success using
Jeff Chan’s Spam URI Realtime Blocklists. When an e-mail arrives,
SpamAssassin extracts the urls contained within and performs a few DNS
TXT queries to find whether the url has been reported in spam. These
blocklists can be used for weblogs too. Instead of Jay maintaining a
central blocklist that people download and install manually,
mt-blacklist could use a DNS-based blocklist that is effectively updated
in real time. This would significantly cut down on comment spam because
weblog owners would not need to actively maintain their blocklists. The
submission process could be streamlined so that it doesn’t consume so
much of any one person’s time.

I’m very interested to hear any comments that you may have on this idea
and encourage you to pass this information on to your developer lists as
well as to other weblog software developers that I have missed.

I look forward to collaborating with you in the future.

Best regards,

Henry Stern
Committer, SpamAssassin

Comment from Rich Ozer
Time 12/22/2004 at 1:09 pm

One more effor in this area:

http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html

I recall burning three or four weeks of a sabbatical getting Saccade.com on the air with Wordpress. So much tweaking…

I move my pretty useless blog to Hugo about 7 years ago, since I got frustrated at too many security…

Something I used to good effect for a while was a "Pocketmod". You take a single page, fold it a…

Bloat is a serious problem, to be sure, but I'm not aware of many modern programming languages that avoid it.…

I'm running static pages (Notepad++) and a couple instances of Wordpress, and an instance of dokuwiki, all on ubuntu on…

Fighting Referer Spam

Comments

About me…

Latest Comments