December 5, 2024

Too Much Spam, Not Enough Identification

Lots of good stuff yesterday at the Meltdown conference. Rather than summarize it all, let me give you two random observations about the discussion.

The security session descended into a series of rants about the evil of spam. Lately this seems to happen often in conference panels about security. This strikes me as odd, since spam is far from the worst security problem we face online. Don’t get me wrong; spam annoys me, just like everybody else. But I don’t think we’ll make much progress on the spam problem until we get a handle on more fundamental problems, such as how to protect ordinary machines from hijacking, and how to produce higher-quality commercial software.

Another interesting feature, noted by Michael Froomkin, was the central role of identification technologies in the day’s discussions, both in diagnoses of Internet policy problems, and in proposed solutions. When the topic was spam, people liked technologies that identify message senders; but on other topics, identification was considered harmful. I hope to see more discussion about identification at the conference. (I’ll have another posting on online identification later this week.)

[Susan Crawford has an interesting summary of yesterday’s discussion. She says I was “wise in the hallways”, whatever that means.]

Victims of Spam Filtering

Eric Rescorla wrote recently about three people who must have lots of trouble getting their email through spam filters: Jose Viagra, Julia Cialis, and Josh Ambien. I feel especially sorry for poor Jose, who through no fault of his own must get nothing but smirks whenever he says his name.

Anyway, this reminded me of an interesting problem with Bayesian spam filters: they’re trained by the bad guys.

[Background: A Bayesian spam filter uses human advice to learn how to recognize spam. A human classifies messages into spam and non-spam. The Bayesian filter assigns a score to each word, depending on how often that word appears in spam vs. non-spam messages. Newly arrived messages are then classified based on the scores of the words they contain. Words used mostly in spam, such as “Viagra”, get negative scores, so messages containing them tend to get classified as spam. Which is good, unless your name is Jose Viagra.]

Many spammers have taken to lacing their messages with sections of “word salad” containing meaningless strings of innocuous-looking words, in the hopes that the word salad will trigger positive associations in the recipient’s Bayesian filter.

Now suppose a big spammer wanted to poison a particular word, so that messages containing that word would be (mis)classified as spam. The spammer could sprinkle the target word throughout the word salad in his outgoing spam messages. When users classified those messages as spam, the targeted word would develop a negative score in the users’ Bayesian spam filters. Later, messages with the targeted word would likely be mistaken for spam.

This attack could even be carried out against a particular targeted user. By feeding that user a steady diet of spam (or pseudo-spam) containing the target word, a malicious person could build up a highly negative score for that word in the targeted user’s filter.

Of course, this won’t work, or will be less effective, for words that have appeared frequently in a user’s legitimate messages in the past. But it might work for a word that is about to become more frequent, such as the name of a person in the news, or a political party. For example, somebody could have tried to poison “Fahrenheit” just before Michael Moore’s movie was released, or “Whitewater” in the early days of the Clinton administration.

There is a general lesson here about the use of learning methods in security. Learning is attractive, because it can adapt to the bad guys’ behavior. But the fact that the bad guys are teaching the system how to behave can also be a serious drawback.

FTC: Do-Not-Email List Won't Help

Yesterday the Federal Trade Commission released its recommendation to Congress regarding the proposed national Do Not Email list. They recommended against the creation of such a list at the present time, because the list would provide little or no reduction in spam, but would increase costs for legitimate emailers and might raise security risks.

Congress, in the CAN-SPAM Act, asked the FTC to study the feasibility of instituting a national Do Not Email list, akin to the popular Do Not Call list. Yesterday’s FTC recommendation is the result of the FTC’s study.

The FTC relied on interviews with many people, and it retained three security experts – Matt Bishop, Avi Rubin, and me – to provide separate reports on the technical issues regarding the Do Not Email list. My report supported the action that the FTC ultimately took, and I assume that the other two reports did too.

I understand that the three expert reports will be released by the FTC, but I haven’t found them on the FTC website yet. I’ll post a link to my report when I find one.

New Survey of Spam Trends

The Pew Internet & American Life Project has released results of a new survey of experiences with email spam.

The report’s headline is “The CAN-SPAM Act Has Not Helped Most Email Users So Far”, and this interpretation is followed by the press articles I have seen so far. But it’s not actually supported by the data. Taken at face value, the data show that the amount of spam has not changed since January 1, when the CAN-SPAM Act took effect.

If true, this is actually good news, since the amount of spam had been increasing previously; for example, according to Brightmail, spam had grown from 7% of all email in April 2001, to 50% in September 2003. If the CAN-SPAM Act put the brakes on that increase, it has been very effective indeed.

Of course, the survey demonstrates only correlation, not causality. The level of spam may be steady, but there is nothing in the survey to suggest that CAN-SPAM is the reason.

An alternative explanation is hiding in the survey results: fewer people may be buying spammers’ products. Five percent of users reported having bought a product or service advertised in spam. That’s down from seven percent in June 2003. Nine percent reported having responded to a spam and later discovered it was phony or fraudulent; that’s down from twelve percent in June 2003.

And note that the survey asked whether the respondent had ever responded to a spam, so the decrease in recent response rates would be much more dramatic. To understand why, imagine a group of 200 people who responded to the latest survey. Suppose that 100 of them are Recent Adopters, having started using the Internet since June 2003, and that the other 100 are Longtime Users who went online before June 2003. According to the previous survey, seven of the Longtime Users (i.e., 7%) bought from a spammer before June 2003; and according to the latest survey, only ten of our overall group of 200 users (i.e., 5%) have ever bought from a spammer. It follows that only three of our other 190 hypothetical users responded to a spam since June 2003, so that spammers are finding many fewer new buyers than before.

A caveat is in order here. The survey’s margin of error is three percent. so we can’t be certain there’s a real trend here. But still, it’s much more likely than not that the number of responders really has decreased.

Spammers Concerned by CAN-SPAM?

Alan Ralsky, one of the biggest spammers, thinks the new CAN-SPAM act will hinder his spamming business, according to Saul Hansell’s story in today’s New York Times. Naturally, eventhing this guy says should be viewed skeptically, but the article is interesting nonetheless.

Mr. Ralsky talks a lot about himself in the article, and a revealing picture emerges. He has constructed a (rationalized) view of himself as a legitimate businessman who has been forced by those nasty antispam technologies to resort to practices like operating underground, forging mail headers, using open relays, and so on. Now the CAN-SPAM Act will ban some of those practices – and he wants us to feel sorry for him!

Mr. Ralsky also claims that he has been inactive (i.e., not spamming) for the past few weeks. I’ve been remarking to people for the last couple of weeks that there seems to be less spam than there was before. I almost wrote a blog entry asking all of you whether you had seen the same thing. Is it just the holiday season? Or is this one guy sending lots of my incoming spam?

Mr. Ralsky says he will soldier on, continuing to spam while complying with the new law. But he worries that his compliance will make it easier for people to filter out his messages. Let’s hope so.