November 22, 2024

Can Google Flu Trends Be Manipulated?

Last week researchers from Google and the Centers for Disease Control unveiled a cool new research result, showing that they could gauge the level of influenza infections in a region of the U.S. by seeing how often people in those regions did Google searches for certain terms related to the flu and flu symptoms. The search-based predictions correlate remarkably well with the medical data on flu rates — not everyone who searches for “cough medicine” has the flu, but enough do that an increase in flu cases correlates with an increase in searches for “cough medicine” and similar terms. The system is called Google Flu Trends.

Privacy groups have complained, but this use of search data seems benign — indeed, this level of flu detection requires only that search data be recorded per region, not per individual user. The legitimate privacy worry here is not about the flu project as it stands today but about other uses that Google or the government might find for search data later.

My concern today is whether Flu Trends can be manipulated. The system makes inferences from how people search, but people can change their search behavior. What if a person or a small group set out to convince Flu Trends that there was a flu outbreak this week?

An obvious approach would be for the conspirators to do lots of searches for likely flu-related terms, to inflate the count of flu-related searches. If all the searches came from a few computers, Flu Trends could presumably detect the anomalous pattern and the algorithm could reduce the influence of these few computers. Perhaps this is already being done; but I don’t think the research paper mentions it.

A more effective approach to spoofing Flu Trends would be to use a botnet — a large collection of hijacked computers — to send flu-related searches to Google from a larger number of computers. If the added searches were diffuse and well-randomized, they would be very hard to distinguish from legitimate searches, and the Flu Trends would probably be fooled.

This possibility is not discussed in the Flu Trends research paper. The paper conspicuously fails to identify any of the search terms that the system is looking for. Normally a paper would list the terms, or at least give examples, but none of the terms appear in the paper, and the Flu Trends web site gives only “flu” as an example search term. They might be withholding the search terms to make manipulation harder, but more likely they’re withholding the search terms for business reasons, perhaps because the terms have value in placing or selling ads.

Why would anyone want to manipulate Flu Trends? If flu rates affect the financial markets by moving the stock prices of certain drug or healthcare companies, then a manipulator can profit by sending false signals about flu rates.

The most interesting question about Flu Trends, though, is what other trends might be identifiable via search terms. Government might use similar methods to look for outbreaks of more virulent diseases, and businesses might look for cultural trends. In all of these cases, manipulation will be a risk.

There’s an interesting analogy to web linking behavior. When the web was young, people put links in their sites to point readers to other interesting sites. But when Google started inferring sites’ importance from their incoming links, manipulators started creating links for their Google-effect. The result was an ongoing cat-and-mouse game between search engines and manipulators. The more search behavior takes on commercial value, the more manipulators will want to change search behavior for commercial or cultural advantage.

Anything that is valuable to measure is probably, to someone, valuable to manipulate.

Election 2008: What Might Go Wrong

Tomorrow, as everyone knows, is Election Day in the U.S. With all the controversy over electronic voting, and the anticipated high turnout, what can we expect to see? What problems might be looming? Here are my predictions.

Long lines to vote: Polling places will be strained by the number of voters. In some places the wait will be long – especially where voting requires the use of machines. Many voters will be willing and able to wait, but some will have to leave without casting votes. Polls will be kept open late, and results will be reported later than expected, because of long lines.

Registration problems: Quite a few voters will arrive at the polling place to find that they are not on the voter rolls, because of official error, or problems with voter registration databases, or simply because the voter went to the wrong polling place. New voters will be especially likely to have such problems. Voters who think they should be on the rolls in a polling place can file provisional ballots there. Afterward, officials must judge whether each provisional voter was in fact eligible, a time-consuming process which, given the relative flood of provisional ballots, will strain official resources.

Voting machine problems: Electronic voting machines will fail somewhere. This is virtually inevitable, given the sheer number of machines and polling places, the variety of voting machines, and the often poor reliability and security engineering of the machines. If we’re lucky, the problems can be addressed using a paper trail or other records. If not, we’ll have a mess on our hands.

How serious the mess might be depends on how close the election is. If the margin of victory is large, as some polls suggest it may be, then it will be easy to write off problems as “minor” and move on to the next stage in our collective political life. If the election is close, we could see a big fight. The worse case is an ultra-close election like in 2000, with long lines, provisional ballots, or voting machine failures putting the outcome in doubt.

Regardless of what happens on Election Day, the next day — Wednesday, November 5 — will be a good time to get started on improving the next election. We have made some progress since 2004 and 2006. If we keep working, our future elections can be better and safer than this one.

Why is printing so hard?

Recently I bought a mildly used laser printer and wanted to set it up on my home network. In a better world, this would be a trivial exercise — just connect the printer to the network and let the computers discover it. In the actual world, it was a forty-five minute project that only a reasonably handy network jockey could have hoped to complete. (If you care about what exactly I had to do, see below.)

John Hartman says, “Printing is the hardest problem in computer science.” It often seems that way. But why?

Plug-and-play printing seems pretty simple, compared to many of the things that computers do routinely without trouble. Granted, it’s not trivial to get the full variety of printers to work with the full variety of computers, but our collective failure to do so is — or should be — surprising.

There must be some lesson here about engineering, or human nature, or something. Lately I’ve gone around asking people why printing is so hard. I’ve gotten some interesting answers, but I don’t think I really understand the issue yet.

What do you think? Why is printing so hard?

[For the record, here’s what I had to do to get our newly acquired HP LaserJet 2200DN printer working on our home network: I plugged the printer in to our network, but the Windows PCs couldn’t auto-discover the printer. I Googled the printer’s user manual, which said the printer had a built-in webserver. But I didn’t know the printer’s IP address, so I had to log in to our router and look at its DHCP tables. Knowing the IP address, I could connect to the printer’s webserver, which had a page telling me what URL to use for IPP printing. (I had to know what IPP was.) After that, I assigned the printer a static IP address, so the IPP URL (containing an IP address) would keep working across reboots. Now that I had a stable IPP URL, I could set up the PCs for printing. Finally, I had to guess which of driver to use on Windows — two drivers were offered, with no advice about which one to use, but only one of the offered drivers supports duplex printing. Total elapsed time: about 45 minutes.]

How Yahoo could have protected Palin's email

Last week I criticized Yahoo for their insecure password recovery mechanism that allowed an intruder to take control of Sarah Palin’s email account. Several readers asked me the obvious follow-up question: What should Yahoo have done instead?

Before we discuss alternatives, let’s take a minute to appreciate the delicate balance involved in designing a password recovery mechanism for a free, mass-market web service. On the one hand, users lose their passwords all the time; they generally refuse to take precautions in advance against a lost password; and they won’t accept being locked out of their own accounts because of a lost password. On the other hand, password recovery is an obvious vector for attack — and one exploited at large scale, every day, by spammers and other troublemakers.

Password recovery is especially challenging for email accounts. A common approach to password recovery is to email a new password (or a unique recovery URL) to the user, which works nicely if the user has a stable email address outside the service — but there’s no point in sending email to a user who has lost the password to his only email account.

Still, Yahoo could be doing more to protect their users’ passwords. They could allow users to make up their own security questions, rather than offering only a fixed set of questions. They could warn users that security questions are a security risk and that users with stable external email addresses might be better off disabling the security-question functionality and relying instead on email for password recovery.

Yahoo could also have followed Gmail’s lead, and disabled the security-question mechanism unless no logged-in user had accessed the account for five days. This clever trick prevents password “recovery” when there is evidence that somebody who knows the password is actively using the account. If the legitimate user loses the password and doesn’t have an alternative email account, he has to wait five days before recovering the password, but this seems like a small price to pay for the extra security.

Finally, Yahoo would have been wise, at least from a public-relations standpoint, to give extra protection to high-profile accounts like Palin’s. The existence of these accounts, and even the email addresses, had already been published online. And the account signup at Yahoo asks for a name and postal code so Yahoo could have recognized that this suddenly-important public figure had an account on their system. (It seems unlikely that Palin gave a false name or postal code in signing up for the account.) Given the public allegations that Palin had used her Yahoo email accounts for state business, these accounts would have been obvious targets for freelance “investigators”.

Some commenters on my previous post argued that all of this is Palin’s fault for using a Yahoo mail account for Alaska state business. As I understand it, the breached account included some state business emails along with some private email. I’ll agree that it was unwise for Palin to put official state email into a Yahoo account, for security reasons alone, not to mention the state rules or laws against doing so. But this doesn’t justify the break-in, and I think anyone would agree that it doesn’t justify publishing non-incriminating private emails taken from the account.

Indeed, the feeding frenzy to grab and publish private material from the account, after the intruder had published the password, is perhaps the ugliest aspect of the whole incident. I don’t know how many people participated — and I’m glad that at least one Good Samaritan tried to re-lock the account — but I hope the republishers get at least a scary visit from the FBI. It looks like the FBI is closing in on the initial intruder. I assume he is facing a bigger punishment.

Palin's email breached through weak Yahoo password recovery mechanism

This week’s breach of Sarah Palin’s Yahoo Mail account has been much discussed. One aspect that has gotten less attention is how the breach occurred, and what it tells us about security and online behavior.

(My understanding of the facts is based on press stories, and on reading a forum post written by somebody claiming to be the perpetrator. I’m assuming the accuracy of the forum post, so take this with an appropriate grain of salt.)

The attacker apparently got access to the account by using Yahoo’s password reset mechanism, that is, by following the same steps Palin would have followed had she forgotten her own password.

Yahoo’s password reset mechanism is surprisingly weak and easily attacked. To simulate the attack on Palin, I performed the same “attack” on a friend’s account (with the friend’s permission, of course). As far as I know, I followed the same steps that the Palin attacker did.

First, I went to Yahoo’s web site and said I had forgotten my password. It asked me to enter my email address. I entered my friend’s address. It then gave me the option of emailing a new password to my friend’s alternate email address, or doing an immediate password reset on the site. I chose the latter. Yahoo then prompted me with my friend’s security question, which my friend had previously chosen from a list of questions provided by Yahoo. It took me six guesses to get the right answer. Next, Yahoo asked me to confirm my friend’s country of residence and zip code — it displayed the correct values, and I just had to confirm that they were correct. That’s all! The next step had me enter a new password for my friend’s account, which would have allowed me to access the account at will.

The only real security mechanism here is the security question, and it’s often easy to guess the right answer, especially given several tries. Reportedly, Palin’s question was “Where did you meet your spouse?” and the correct answer was “Wasilla high”. Wikipedia says that Palin attended Wasilla High School and met her husband-to-be in high school, so “Wasilla high” is an easy guess.

This attack was not exactly rocket science. Contrary to some news reports, the attacker did not display any particular technical prowess, though he did display stupidity, ethical blindness, and disrespect for the law — for which he will presumably be punished.

Password recovery is often the weakest link in password-based security, but it’s still surprising that Yahoo’s recovery scheme was so weak. In Yahoo’s defense, it’s hard to verify that somebody is really the original account holder when you don’t have much information about who the original account holder is. It’s not like Sarah Palin registered for the email account by showing up at a Yahoo office with three forms of ID. All Yahoo knows is that the original account holder claimed to have the name Sarah Palin, claimed to have been born on a particular date and to live in a particular zip code, and claimed to have met his/her spouse at “Wasilla high”. Since this information was all in the public record, Yahoo really had no way to be sure who the account holder was — so it might have seemed reasonable to give access to somebody who showed up later claiming to have the same name, email address, and spouse-meeting place.

Still, we shouldn’t let Yahoo off the hook completely. Millions of Yahoo customers who are not security experts (or are security experts but want to delegate security decisions to someone else) entrusted the security of their email accounts to Yahoo on the assumption that Yahoo would provide reasonable security. Palin probably made this assumption, and Yahoo let her down.

If there’s a silver lining in this ugly incident, it is the possibility that Yahoo and other sites will rethink their password recovery mechanisms, and that users will think more carefully about the risk of email breaches.