Fighting Referer Spam

In the last couple of days, I’ve been targetted by referer spam bots. These dorks access pages on a weblog repeatedly in an attempt to get their referer tag listed on your home page. I’ve been trying to figure out how to combat this behavior, and can see two different ways of dealing with it:

  • Ping the referer back, and make sure it does link to my site. Probably slow and not scaleable, particularly in the situation I have with asymmetric bandwidth.
  • Blacklist sites which generate bursts of referer traffic. If we get lots of referers to a particular url in a short period of time, put them in a database of blacklisted sites and keep them from ever appearing in the referer list.

The second seems easy, but I must admit: the query to find such lists seems difficult to write. I’ll continue to think it over, but does anyone have any suggestions?

4 thoughts on “Fighting Referer Spam

  1. Rich Ozer

    Mark,

    Run them through spamassassin. The latest version has URI tests as well as SPF and DNSBL. SA will return a score that you can then use to either accept or block the address. This presumes that the URL’s are also spammy.

  2. Matt May

    I’ve been getting hammered with these, too. My inclination is to feed blank files to anyone with a referrer at *.info, since that’s where the majority of them appear to be coming from, but I agree, it’s a major pain in the ass. Perhaps somebody in the WordPress community is working on it?

  3. Rich Ozer

    I guess I’m not the first person to think of the similarities to spamassassin… this might be worth checking out… came off the spamassassin users list today:

    Hello all,

    Considering the latest press on blog comment spam, I think that it’s
    time that we organize a cross-platform project to address the problem.
    There are a considerable number of plugins implemented for various blog
    software with the intent of reducing blog spam but many are ineffective
    or require a tremendous amount of work to maintain (Jay’s mt-blacklist
    plugin is definitely the latter).

    http://news.netcraft.com/archives/2004/12/17/hosts_disable_movable_type_as_comment_spam_slows_servers.html
    http://it.slashdot.org/article.pl?sid=04/12/18/1827225&tid=111&tid=128
    http://www.sixapart.com/log/2004/12/more_on_comment.shtml

    I propose that we create a subproject of Apache SpamAssassin to
    encourage collaborative research in the area of anti blog spam with the
    goal of producing cross-platform standards and implementations of
    workable comment spam solutions. SpamAssassin’s expertise of anti-spam
    in the e-mail domain will complement the knowledge of the weblogging
    community.

    Here are some of the ideas that I would like to explore further and see
    incorporated into standard installations of blogging software:

    * Proof-of-work: A legitimate user will take several seconds to minutes
    to create each unqiue comment while a comment spammer sends them out as
    fast as possible. Consider a proof-of-work algorithm executed within
    the browser (e.g. javascript, java, activex) added to comment submission
    forms. The weblog software can safely reject all comment submissions
    that lack valid proof of work. Legitimate users will not be
    inconvenienced by a short delay as they submit their comment while
    spammers will not be able to easily submit comments in large volumes.
    For example, if a typical comment spammer sends 1000000 comments per day
    and the proof of work requires 2 seconds of compute time then they will
    need to dedicate 24 machines to proof-of-work computation to maintain
    their rate of transmission. The cons of this method are that users
    without advanced browsers or older, slow computers may not be able to
    post comments.

    There is a javascript implementation of Hashcash that can be combined
    with SpamAssassin’s hashcash verification and duplicate detection
    algorithms to quickly produce a prototype.

    * Collaborative filtering: IronPort maintains a database of e-mail
    server traffic volumes called SenderBase. Mail servers can use
    SenderBase to find “traffic spikes” and potentially block e-mail from
    those servers. Something similar could be done for weblogs. As
    comments come in, weblogs could report the urls in the comments to a
    central server. If an URL is sent in too rapidly, it can be added to a
    list of probable spam urls and weblogs can quarantine or delete comments
    containing that url.

    * DNS-based URI Blocklists: SpamAssassin has had great success using
    Jeff Chan’s Spam URI Realtime Blocklists. When an e-mail arrives,
    SpamAssassin extracts the urls contained within and performs a few DNS
    TXT queries to find whether the url has been reported in spam. These
    blocklists can be used for weblogs too. Instead of Jay maintaining a
    central blocklist that people download and install manually,
    mt-blacklist could use a DNS-based blocklist that is effectively updated
    in real time. This would significantly cut down on comment spam because
    weblog owners would not need to actively maintain their blocklists. The
    submission process could be streamlined so that it doesn’t consume so
    much of any one person’s time.

    I’m very interested to hear any comments that you may have on this idea
    and encourage you to pass this information on to your developer lists as
    well as to other weblog software developers that I have missed.

    I look forward to collaborating with you in the future.

    Best regards,

    Henry Stern
    Committer, SpamAssassin

Comments are closed.