April 20, 2014

avatar

attack of the context-sensitive blog spam?

I love spammers, really I do. Some of you may recall my earlier post here about freezing your credit report. In the past week, I’ve deleted two comments that were clearly spam and that made it through Freedom to Tinker’s Akismet filter. Both had generic, modestly complementary language and a link to some kind of credit card application processing site. What’s interesting about this? One of two things.

  1. Akismet is letting those spams through because their content is “related” to the post.
  2. Or more ominously, the spammer in question is trolling the blogosphere for “relevant” threads and is then inserting “relevant” comment spam.

If it’s the former, then one can certainly imagine that Akismet and other such filters will eventually improve to the point where the problem goes away (i.e., even if it’s “relevant” to a thread here, if it’s posted widely then it must be spam). If it’s the latter, then we’re in trouble. How is an automated spam catcher going to detect “relevant” spam that’s (statistically) on-topic with the discussion where it’s posted and is never posted anywhere else?

Comments

  1. Constance Reader says:

    I also received a lot of that spam. I’ve also received spam that consisted of long, rambling passages that a spam filter would likely interpret as straight prose text, like any comment, but in fact makes no sense whatsoever.

    I’ve also noticed that spam filters don’t deal with other languages well. I’d say that 25% of my spam is in Italian, 5% in Japanese, and for a short time I was getting spam in what may have been Hindi.

  2. Barry says:

    There’ll never be a fully automatic way to filter spam. Probably the best way I’ve seen of handling spam is the way some large community websites do it (such as Slashdot) – the well-established masses gain the ability to moderate comments, and can mod down spam. Of course, that doesn’t help for a smaller blog where not very many people post on a regular basis.

    Two other possibly helpful suggestions: One, ensure that links in blog posts always get the “nofollow” attribute attached to them, which diminishes their worth in search engine “optimization”.

    The other, require username registration, and then moderate said users to ensure that they at least try to make one on-topic spam-free post before letting their comments go straight to the blog post. It’s draconian, yes, so individual blog owners have to make the decision as to whether blocking blog spam is worth the effort on the part of their legitimate readers.

  3. Michael R. Bernstein says:

    Dan, that’s a hard problem because technically it isn’t spam anymore, it is astroturfing. There are some stopgap solutions that eventually force the bad actors to employ humans for this purpose, so ultimately what works are various forms of reputation currency.

  4. Arvind Narayanan says:

    Dan,

    When running from a bear, you don’t need to outrun the bear. You just need to outrun the slowest runner.

    Currently there are far easier methods to successfully spam blogs than content awareness (I wouldn’t call it context sensitivity!) For instance, see this post I made a while ago. I don’t see this changing very soon. So as long as you have some sort of spam filter set up, you’re probably fine.

  5. Walt Crawford says:

    Interesting. I’ve gotten quite a bit of that sort of spam–but Spam Karma 2 has trapped nearly all of it. (Yes, I do check: Spam Karma 2 seems to trap legit comments about once in a hundred times.) Maybe my settings are tougher, or maybe Spam Karma 2′s algorithms just work better than Akismet in this situation.

  6. David Robarts says:

    the “nofollow” suggestion is a good one, as it decreases the value of successful spamming to the spammer. I would expect that that alone would make it unprofitable to set a real human to the task. If it is spam bots getting smarter then the filters can be made smarter too; the cat and mouse game continues. Adding “nofollow” may not deter a bot because the cost of spamming by bot is too low to care. It would be nice to be able to validate comments to remove the “nofollow” and spread legitimate “google juice.”

  7. Dan Wallach says:

    Freedom to Tinker, and most other blogs, already have the nofollow attribute set on outbound links. This spammer could well have been looking for keywords like “credit” and then inserting the “relevant” credit-card message. The spammer isn’t doing this for better search-engine cred, but rather for more site visitors.

    Arvind: I agree that “content awareness” is probably a better way to describe this particular attack than “context sensitivity.”

  8. Daniel Sandler says:

    Analogy to this situation: What if you receive a hand-written, personal letter—that happens to be trying to sell you something?

    I’m not given to sports metaphors, but in a sense I think we’ve “moved the chains” with respect to blog spam. The stuff that gets through our mechanical filters is of sufficient relevance that it becomes an editorial decision whether or not to republish it for everyone else to read.

    Arvind makes a good point (re: running from bears), but the joke doesn’t work if there’s more than one bear.

    In fact, there are so many people trying to make money in Internet advertising that bloggers are actually outrunning an enormous plague of bears that chokes the landscape. Some of them are willing to put in more effort than others, as we see here. Our goal is simply to outrun 99% of them, and have a polite discussion with the remaining 1%.

  9. Matt says:

    Obviously, the spammers have solved natural-language processing ;)

  10. Dan Wallach says:

    Case in point: please diagnose the above “topcreditcardsadvice” post. Spam?

  11. ignorance says:

    Consider that it’s both 1) and 2), sans humans.

    A spammer has sold services to promote some sort of service related to credit reports. They set up a Technorati or other blog search to find blog posts that contain the term ‘credit report’ and spam them.

    In the above case, the trackback spam links to a spam blog, each post populated with a few introduction phrases and part of the spammed post, which is easy to copy because it’s conveniently encoded in easy-to-parse XML. The spam-blog generates income from ads and direct services.

    If there is a text that a spam filter can trust, a spammer will use it, any and every way she can.

  12. Daniel Sandler says:

    Wallach: Uncanny.

    If you click through (sigh) and read the source URL, you can tell that it’s been pasted together from a template:

    $POST_AUTHOR wrote a fantastic post today on “$POST_TITLE”
    Here’s ONLY a quick extract
    $POST_BODY_SENTENCES[1][0:233] …
    To view the rest of this excellent post, you MUST go here

    All these fields are conveniently available in FTT’s RSS feed, saving the spammer the trouble of manually searching for the author’s name and so on.

    (Aside: That appears to be either TrackBack or Pingback spam, which means the entire process was automated: (1) Find some blog’s RSS feed, (2) extract the author, etc., (3) extract and post to your spam blog, causing a Pingback to be sent, (4) goto 1.)

  13. dsn says:

    At a certain point, if the spam becomes targeted enough and useful enough, doesn’t it cease to be spam?

  14. Miss Grundy says:

    I think you meant “comlimentary,” not “complementary.” I, too, will compliment this blog, which is among my favorites.

  15. enigma_foundry says:

    Yes, I have had the same spam on my blog, which appears in Russian and German from time to time–but won’t those tests with the nearly illegible letters work?

  16. John says:

    Did you really mean “complementary” ?

    I can’t make any sense of your post.

  17. Tel says:

    Be suspicious of any link with “blogfeeds” in the URL

  18. Tel says:

    Try a google search on “blogfeedsworld”, spam all over the place… they sure have some cheek with a comment like this one:

    We scour the musical blogosphere to locate interesting, quality blog posts worthy of your attention. Provided excerpts are taken 100% legally from syndicated feeds (e.g. google blogsearch) and a link to the original post is provided.

    If you are a blog owner and would not like to syndicate your content to sites like this one, then please delete your blog feed from your server.

    As far as I can see, the “semantic parsing” and “context sensitive” is nothing more than using google blogsearch on particular keywords and then slurp RSS out of the blogs that show up in a search. The RSS goes into both content creation for the bait site and Pingback links toward the bait site. Possibly a suitable Copyright license might be able to stop them, possibly they might slip in under fair use.

    A sneaky way to track them would be putting magic codes into your RSS feed that lets you find the IP number of the slurper.

  19. Xcott Craver says:

    One of my tongue-in-cheek 2007 predictions was that blog spam would cross the line into useful automated content.

    Obviously the next step for spammers is to make their posts not only natural-looking, but give them some genuine utility. To make them either entertaining, or in some way helpful. After all, spam-blocking happens because (1) we can detect it and (2) we despise it. If you can’t beat (1), there’s always (2).

    If’n I was a spammer, and I was scanning blog entries for topics, maybe my next step would be to identify separate blog entries on the same topic, and post a comment in your blog pointing to other entries I also found when scouring the IntarBlogs. Or if I could extract any other useful results or statistics from my parsing, any useful statistics, I could post them. Think something like a search engine, but with home delivery.

    Then one day, you’ll post a knock-knock joke on your blog, and get 30 more knock-knock jokes in response, each sponsored by some shady pagerank placement company.

  20. Dan Wallach says:

    For the record, since my last post I’ve deleted seven other “relevant” spams from this thread and none from any of my other Freedom to Tinker threads. I’ll take that as strong evidence for the existence of content-aware spam.

  21. David Harmon says:

    Xcott: Indeed — if you consider the Internet as an ecology, this would represent the earliest steps of a transition from parasitism to symbiosis. Of course, they’re already trying to look useful (as above, and StarWare comes to mind as well.) They’ve been trying to look entertaining, since the first Madonna trojan (for the Apple II) — but then, there’s a lower standard for that attack! (How many of you have had to tell someone some version of “THERE IS NO NAKED BRITNEY VIDEO, so stop bypassing your antivirus software!”)

  22. Anders says:

    Rather than just looking at content, what about a CAPCHA system that doesn’t require users to type anything such as http://www.JustHumans.com/