November 21, 2024

Archives for November 2008

Economic Growth, Censorship, and Search Engines

Economic growth depends on an ability to access relevant information. Although censorship prevents access to certain information, the direct consequences of censorship are well-known and somewhat predictable. For example, blocking access to Falun Gong literature is unlikely to harm a country’s consumer electronics industry. On the web, however, information of all types is interconnected. Blocking a web page might have an indirect impact reaching well beyond that page’s contents. To understand this impact, let’s consider how search results are affected by censorship.

Search engines keep track of what’s available on the web and suggest useful pages to users. No comprehensive list of web pages exists, so search providers check known pages for links to unknown neighbors. If a government blocks a page, all links from the page to its neighbors are lost. Unless detours exist to the page’s unknown neighbors, those neighbors become unreachable and remain unknown. These unknown pages can’t appear in search results — even if their contents are uncontroversial.

When presented with a query, search engines respond with relevant known pages sorted by expected usefulness. Censorship also affects this sorting process. In predicting usefulness, search engines consider both the contents of pages and the links between pages. Links here are like friendships in a stereotypical high school popularity contest: the more popular friends you have, the more popular you become. If your friend moves away, you become less popular, which makes your friends less popular by association, and so on. Even people you’ve never met might be affected.

“Popular” web pages tend to appear higher in search results. Censoring a page distorts this popularity contest and can change the order of even unrelated results. As more pages are blocked, the censored view of the web becomes increasingly distorted. As an aside, Ed notes that blocking a page removes more than just the offending material. If censors block Ed’s site due to an off-hand comment on Falun Gong, he also loses any influence he has on information security.

These effects would typically be rare and have a disproportionately small impact on popular pages. Google’s emphasis on the long tail, however, suggests that considerable value lies in providing high-quality results covering even less-popular pages. To avoid these issues, a government could allow limited individuals full web access to develop tools like search engines. This approach seems likely to stifle competition and innovation.

Countries with greater censorship might produce lower-quality search engines, but Google, Yahoo, Microsoft, and others can provide high-quality search results in those countries. These companies can access uncensored data, mitigating the indirect effects of censorship. This emphasizes the significance of measures like the Global Network Initiative, which has a participant list that includes Google, Yahoo, and Microsoft. Among other things, the initiative provides guidelines for participants regarding when and how information access may be restricted. The effectiveness of this specific initiative remains to be seen, but such measures may provide leading search engines with greater leverage to resist arbitrary censorship.

Search engines are unlikely to be the only tools adversely impacted by the indirect effects of censorship. Any tool that relies on links between information (think social networks) might be affected, and repressive states place themselves at a competitive disadvantage in developing these tools. Future developments might make these points moot: in a recent talk at the Center, Ethan Zuckerman mentioned tricks and trends that might make censorship more difficult. In the meantime, however, governments that censor information may increasingly find that they do so at their own expense.

Does Your House Need a Tail?

Thus far, the debate over broadband deployment has generally been between those who believe that private telecom incumbents should be in charge of planning, financing and building next-generation broadband infrastructure, and those who advocate a larger role for government in the deployment of broadband infrastructure. These proposals include municipal-owned networks and a variety of subsidies and mandates at the federal level for incumbents to deploy faster broadband.

Tim Wu and Derek Slater have a great new paper out that approaches the problem from a different perspective: that broadband deployments could be planned and financed not by government or private industry, but by consumers themselves. That might sound like a crazy idea at first blush, but Wu and Slater do a great job of explaining how it might work. The key idea is “condominium fiber,” an arrangement in which a number of neighboring households pool their resources to install fiber to all the homes in their neighborhoods. Once constructed, each home would own its own fiber strand, while the shared costs of maintaining the “trunk” cable from the individual homes to a central switching location would be managed in the same way that condominium and homeowners’ associations currently manage the shared areas of condos and gated communities. Indeed, in many cases the developer of a new condominium tower or planned community could lay fiber along with water and power lines, and the fiber would be just one of the shared resources that would be managed collectively by the homeowners.

If that sounds strange, it’s important to remember that there are plenty of examples where things that were formerly rented became owned. For example, fifty years ago in the United States no one owned a telephone. The phone was owned by Ma Bell and if yours broke they’d come and install a new one. But that changed, and now people own their phones and the wiring inside their homes, with your phone company owning the cable outside the home. One way to think about Slater and Wu’s “homes with tails” concept is that it’s just shifting that line of demarcation again. Under their proposal, you’d own the wiring inside your home and the line from you to your broadband provider.

Why would someone want to do such a thing? The biggest advantage, from my perspective, is that it could solve the thorny problem of limited competition in the “last mile” of broadband deployment. Right now, most customers have two options for high-speed Internet access. Getting more options using the traditional, centralized investment model is going to be extremely difficult because it costs a lot to deploy new infrastructure all the way to customers’ homes. But if customers “brought their own” fiber, then the barrier to entry would be much lower. New providers would simply need to bring a single strand of fiber to a neighborhood’s centralized point of presence in order to offer service to all customers in that neighborhood. So it would be much easier to imagine a world in which customers had numerous options to choose from.

The challenge is solving the chicken-and-egg problem: customer owned fiber won’t be attractive until there are several providers to choose from, but it doesn’t make sense for new firms to enter this market until there are a significant number of neighborhoods with customer-owned fiber. Wu and Slater suggest several ways this chicken-and-egg problem might be overcome, but I think it will remain a formidable challenge. My guess is that at least at the outset, the customer-owned model will work best in new residential construction projects, where the costs of deploying fiber will be very low (because they’ll already be digging trenches for power and water).

But the beauty of their model is that unlike a lot of other plans to encourage broadband deployment, this isn’t an all-or-nothing choice. We don’t have to convince an entire nation, state, or even city to sign onto a concept like this. All you need is a neighborhood with a few dozen early-adopting consumers and an ISP willing to serve them. Virtually every cutting-edge technology is taken up by a small number of early adopters (who pay high prices for the privilege of being the first with a new technology) before it spreads to the general public, and the same model is likely to apply to customer-owned fiber. If the concept is viable, someone will figure out how to make it work, and their example will be duplicated elsewhere. So I don’t know if customer-owned fiber is the wave of the future, but I do hope that people start experimenting with it.

You can check out their paper here. You can also check out an article I wrote for Ars Technica this summer that is based on conversations with Slater, Wu, and other pioneers in this area.

Discerning Voter Intent in the Minnesota Recount

Minnesota election officials are hand-counting millions of ballots, as they perform a full recount in the ultra-close Senate race between Norm Coleman and Al Franken. Minnesota Public Radio offers a fascinating gallery of ballots that generated disputes about voter intent.

A good example is this one:

A scanning machine would see the Coleman and Franken bubbles both filled, and call this ballot an overvote. But this might be a Franken vote, if the voter filled in both slots by mistake, then wrote “No” next to Coleman’s name.

Other cases are more difficult, like this one:

Do we call this an overvote, because two bubbles are filled? Or do we give the vote to Coleman, because his bubble was filled in more completely?

Then there’s this ballot, which is destined to be famous if the recount descends into ligitation:

[Insert your own joke here.]

This one raises yet another issue:

Here the problem is the fingerprint on the ballot. Election laws prohibit voters from putting distinguishing marks on their ballots, and marked ballots are declared invalid, for good reason: uniquely marked ballots can be identified later, allowing a criminal to pay the voter for voting “correctly” or punish him for voting “incorrectly”. Is the fingerprint here an identifying mark? And if so, how can you reject this ballot and accept the distinctive “Lizard People” ballot?

Many e-voting experts advocate optical-scan voting. The ballots above illustrate one argument against opscan: filling in the ballot is a free-form activity that can create ambiguous or identifiable ballots. This creates a problem in super-close elections, because ambiguous ballots may make it impossible to agree on who should have won the election.

Wearing my pure-scientist hat (which I still own, though it sometimes gets dusty), this is unsurprising: an election is a measurement process, and all measurement processes have built-in errors that can make the result uncertain. This is easily dealt with, by saying something like this: Candidate A won by 73 votes, plus or minus a 95% confidence interval of 281 votes. Or perhaps this: Candidate A won with 57% probability. Problem solved!

In the real world, of course, we need to declare exactly one candidate to be the winner, and a lot can be at stake in the decision. If the evidence is truly ambiguous, somebody is going to end up feeling cheated, and the most we can hope for is a sense that the rules were properly followed in determining the outcome.

Still, we need to keep this in perspective. By all reports, the number of ambiguous ballots in Minnesota is miniscule, compared to the total number cast in Minnesota. Let’s hope that, even if some individual ballots don’t speak clearly, the ballots taken collectively leave no doubt as to the winner.

Low Hit Rate Isn't the Problem with TSA Screening

The TSA, which oversees U.S. airport security, comes in for a lot of criticism — much of it deserved. But sometimes commentators let their dislike for the TSA get the better of them, and they offer critiques that don’t stand up logically.

A good example is yesterday’s USA Today article on TSA’s behavioral screening program, and the commentary that followed it. The TSA program trained screeners to look for nervous and suspicious behavior, and to subject travellers exhibiting such behavior to more stringent security measures such as pat-down searches or short interviews.

Commentators condemned the TSA program because fewer than 1% of the selected travellers were ultimately arrested. Is this a sensible objection? I think not, for reasons I’ll explain below.

Before I explain why, let’s take a minute to set aside our general opinions about the TSA. Forget the mandatory shoe removal and toiletry-container nitpicking. Forget that time the screener was rude to you. Forget the slippery answers to inconvenient Constitutional questions. Forget the hours you have spent waiting in line. Put on your blinders please, just for now. We’ll take them off later.

Now suppose that TSA head Kip Hawley came to you and asked you to submit voluntarily to a pat-down search the next time you travel. And suppose you knew, with complete certainty, that if you agreed to the search, this would magically give the TSA a 0.1% chance of stopping a deadly crime. You’d agree to the search, wouldn’t you? Any reasonable person would accept the search to save (by assumption) at least 0.001 lives. This hypothetical TSA program is reasonable, even though it only has a 0.1% arrest rate. (I’m assuming here that an attack would cost only one life. Attacks that killed more people would justify searches with an even smaller arrest rate.)

So the commentators’ critique is weak — but of course this doesn’t mean the TSA program should be seen as a success. The article says that the arrests the system generates are mostly for drug charges or carrying a false ID. Should a false-ID arrest be considered a success for the system? Certainly we don’t want to condone the use of false ID, but I’d bet most of these people are just trying to save money by flying on a ticket in another person’s name — which hardly makes them Public Enemy Number One. Is it really worth doing hundreds of searches to catch one such person? Are those searches really the best use of TSA screeners’ time? Probably not.

On the whole, I’m not sure I can say whether the behavioral screening program is a good idea. It apparently hasn’t caught any big fish yet, but it might have positive effects by deterring some serious crimes. We haven’t seen the data to support it, and we’ve learned to be skeptical of TSA claims that some security measure is necessary.

Now it’s time for the professor to call on one of the diehard civil libertarians in the class, who by this point are bouncing in their seats with both hands waving in the air. They’re dying to point out that our system, for good reason, doesn’t automatically accept claims by the authorities that searches or seizures are justified, and that our institutions are properly skeptical about expanding the scope of searches. They’re unhappy that the debate about this TSA program is happening after it was in place, rather than before it started. These are all good points.

The TSA’s behavioral screening is a rich topic for debate — but not because of its arrest rate.

Can Google Flu Trends Be Manipulated?

Last week researchers from Google and the Centers for Disease Control unveiled a cool new research result, showing that they could gauge the level of influenza infections in a region of the U.S. by seeing how often people in those regions did Google searches for certain terms related to the flu and flu symptoms. The search-based predictions correlate remarkably well with the medical data on flu rates — not everyone who searches for “cough medicine” has the flu, but enough do that an increase in flu cases correlates with an increase in searches for “cough medicine” and similar terms. The system is called Google Flu Trends.

Privacy groups have complained, but this use of search data seems benign — indeed, this level of flu detection requires only that search data be recorded per region, not per individual user. The legitimate privacy worry here is not about the flu project as it stands today but about other uses that Google or the government might find for search data later.

My concern today is whether Flu Trends can be manipulated. The system makes inferences from how people search, but people can change their search behavior. What if a person or a small group set out to convince Flu Trends that there was a flu outbreak this week?

An obvious approach would be for the conspirators to do lots of searches for likely flu-related terms, to inflate the count of flu-related searches. If all the searches came from a few computers, Flu Trends could presumably detect the anomalous pattern and the algorithm could reduce the influence of these few computers. Perhaps this is already being done; but I don’t think the research paper mentions it.

A more effective approach to spoofing Flu Trends would be to use a botnet — a large collection of hijacked computers — to send flu-related searches to Google from a larger number of computers. If the added searches were diffuse and well-randomized, they would be very hard to distinguish from legitimate searches, and the Flu Trends would probably be fooled.

This possibility is not discussed in the Flu Trends research paper. The paper conspicuously fails to identify any of the search terms that the system is looking for. Normally a paper would list the terms, or at least give examples, but none of the terms appear in the paper, and the Flu Trends web site gives only “flu” as an example search term. They might be withholding the search terms to make manipulation harder, but more likely they’re withholding the search terms for business reasons, perhaps because the terms have value in placing or selling ads.

Why would anyone want to manipulate Flu Trends? If flu rates affect the financial markets by moving the stock prices of certain drug or healthcare companies, then a manipulator can profit by sending false signals about flu rates.

The most interesting question about Flu Trends, though, is what other trends might be identifiable via search terms. Government might use similar methods to look for outbreaks of more virulent diseases, and businesses might look for cultural trends. In all of these cases, manipulation will be a risk.

There’s an interesting analogy to web linking behavior. When the web was young, people put links in their sites to point readers to other interesting sites. But when Google started inferring sites’ importance from their incoming links, manipulators started creating links for their Google-effect. The result was an ongoing cat-and-mouse game between search engines and manipulators. The more search behavior takes on commercial value, the more manipulators will want to change search behavior for commercial or cultural advantage.

Anything that is valuable to measure is probably, to someone, valuable to manipulate.