Last week researchers from Google and the Centers for Disease Control unveiled a cool new research result, showing that they could gauge the level of influenza infections in a region of the U.S. by seeing how often people in those regions did Google searches for certain terms related to the flu and flu symptoms. The search-based predictions correlate remarkably well with the medical data on flu rates — not everyone who searches for “cough medicine” has the flu, but enough do that an increase in flu cases correlates with an increase in searches for “cough medicine” and similar terms. The system is called Google Flu Trends.
Privacy groups have complained, but this use of search data seems benign — indeed, this level of flu detection requires only that search data be recorded per region, not per individual user. The legitimate privacy worry here is not about the flu project as it stands today but about other uses that Google or the government might find for search data later.
My concern today is whether Flu Trends can be manipulated. The system makes inferences from how people search, but people can change their search behavior. What if a person or a small group set out to convince Flu Trends that there was a flu outbreak this week?
An obvious approach would be for the conspirators to do lots of searches for likely flu-related terms, to inflate the count of flu-related searches. If all the searches came from a few computers, Flu Trends could presumably detect the anomalous pattern and the algorithm could reduce the influence of these few computers. Perhaps this is already being done; but I don’t think the research paper mentions it.
A more effective approach to spoofing Flu Trends would be to use a botnet — a large collection of hijacked computers — to send flu-related searches to Google from a larger number of computers. If the added searches were diffuse and well-randomized, they would be very hard to distinguish from legitimate searches, and the Flu Trends would probably be fooled.
This possibility is not discussed in the Flu Trends research paper. The paper conspicuously fails to identify any of the search terms that the system is looking for. Normally a paper would list the terms, or at least give examples, but none of the terms appear in the paper, and the Flu Trends web site gives only “flu” as an example search term. They might be withholding the search terms to make manipulation harder, but more likely they’re withholding the search terms for business reasons, perhaps because the terms have value in placing or selling ads.
Why would anyone want to manipulate Flu Trends? If flu rates affect the financial markets by moving the stock prices of certain drug or healthcare companies, then a manipulator can profit by sending false signals about flu rates.
The most interesting question about Flu Trends, though, is what other trends might be identifiable via search terms. Government might use similar methods to look for outbreaks of more virulent diseases, and businesses might look for cultural trends. In all of these cases, manipulation will be a risk.
There’s an interesting analogy to web linking behavior. When the web was young, people put links in their sites to point readers to other interesting sites. But when Google started inferring sites’ importance from their incoming links, manipulators started creating links for their Google-effect. The result was an ongoing cat-and-mouse game between search engines and manipulators. The more search behavior takes on commercial value, the more manipulators will want to change search behavior for commercial or cultural advantage.
Anything that is valuable to measure is probably, to someone, valuable to manipulate.