November 23, 2024

Finding and Fixing Errors in Google's Book Catalog

There was a fascinating exchange about errors in Google’s book catalog over at the Language Log recently. We rarely see such an open and constructive discussion of errors in large data sets, so this is an unusual opportunity to learn about how errors arise and what can be done about them.

The exchange started with Geoffrey Nunberg pointing to many errors in the metadata associated with Google’s book search project. (Here “metadata” refers to the kind of information that would have been on a card in the card catalog of an traditional library: a book’s date of publication, subject classification, an so on.) Some of the errors are pretty amusing, including Dickens writing books before he was born, a Bob Dylan biography published in the nineteenth century, Moby Dick classified under “computers”. Nunberg called this a “train wreck” and blamed Google’s overaggressive use of computer analysis to extract bibliographic information from scanned images.

Things really got interesting when Google’s Jon Orwant replied (note that the red text starting “GN” is Nunberg’s response to Orwant), with an extraordinarily open and constructive discussion of how the errors described by Nunberg arose, and the problems Google faces in trying to ensure accuracy of a huge dataset drawn from diverse sources.

Orwant starts, for example, by acknowledging that Google’s metadata probably contains millions of errors. But he asserts that that is to be expected, at least at first: “we’ve learned the hard way that when you’re dealing with a trillion metadata fields, one-in-a-million errors happen a million times over.” If you take catalogs from many sources and aggregate them into a single meta-catalog — more or less what Google is doing — you’ll inherit all the errors of your sources, unless you’re extraordinarily clever and diligent in comparing different sources to sniff out likely errors.

To make things worse, the very power and flexibility of a digital index can raise the visibility of the errors that do exist, by making them easy to find. Want to find all of the books, anywhere in the world, written by Charles Dickens and (wrongly thought to be) published before 1850? Just type a simple query. Google’s search technology did a lot to help Nunberg find errors. But it’s not just error-hunters who will find more errors — if a more powerful metadata search facility is more useful, researchers will rely on it more, and will therefore be tripped up by more errors.

What’s most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google’s metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google’s metadata a “catalog” seems to connote a level of completion and immutability that Google might not assert. An electronic “card catalog” can change every day — a good thing if the changes are strict improvements such as error fixes — in a way that a traditional card catalog wouldn’t.

Over time, the errors Nunberg reported will be fixed, and as a side effect some errors with similar causes will be fixed too. Whether that is good enough remains to be seen.

Subpoenas and Search Warrants as Security Threats

When I teach computer security, one of the first lessons is on the need to have a clear threat model, that is, a clearly defined statement of which harms you are trying to prevent, and what assumptions you are making about the capabilities and motivation of the adversaries who are trying to cause those harms. Many security failures stem from threat model confusion. Conversely, a good threat model often shapes the solution.

The same is true for security research: the solutions you develop will depend strongly on what threat you are trying to address.

Lately I’ve noticed more and more papers in the computer security research literature that include subpoenas and/or search warrants as part of their threat model. For example, the Vanish paper, which won Best Student Paper (the de facto best paper award) at the recent Usenix Security symposium, uses the word “subpoena” 13 times, in passages like this:

Attackers. Our motivation is to protect against retroactive data disclosures, e.g., in response to a subpoena, court order, malicious compromise of archived data, or accidental data leakage. For some of these cases, such as the subpoena, the party initiating the subpoena is the obvious “attacker.” The final attacker could be a user’s ex-husband’s lawyer, an insurance company, or a prosecutor. But executing a subpoena is a complex process involving many other actors …. For our purposes we define all the involved actors as the “adversary.”

(I don’t mean to single out this particular paper. This is just the paper I had at hand — others make the same move.)

Certainly, subpoenas are no fun for any of the parties involved. They’re costly to deal with, not to mention the ick factor inherent in compelled disclosure to a stranger, even if you’re totally blameless. And certainly, subpoenas are sometimes used to harass, rather than to gather legitimately relevant evidence. But are subpoenas really the biggest threat to email confidentiality? Are they anywhere close to the biggest threat? Almost certainly not.

Usually when the threat model mentions subpoenas, the bigger threats in reality come from malicious intruders or insiders. The biggest risk in storing my documents on CloudCorp’s servers is probably that somebody working at CloudCorp, or a contractor hired by them, will mess up or misbehave.

So why talk about subpoenas rather than intruders or insiders? Perhaps this kind of talk is more diplomatic than the alternative. If I’m talking about the risks of Gmail, I might prefer not to point out that my friends at Google could hire someone who is less than diligent, or less than honest. If I talk about subpoenas as the threat, nobody in the room is offended, and the security measures I recommend might still be useful against intruders and insiders. It’s more polite to talk about data losses that are compelled by a mysterious, powerful Other — in this case an Anonymous Lawyer.

Politeness aside, overemphasizing subpoena threats can be harmful in at least two ways. First, we can easily forget that enforcement of subpoenas is often, though not always, in society’s interest. Our legal system works better when fact-finders have access to a broader range of truthful evidence. That’s why we have subpoenas in the first place. Not all subpoenas are good — and in some places with corrupt or evil legal systems, subpoenas deserve no legitimacy at all — but we mustn’t lose sight of society’s desire to balance the very real cost imposed on the subpoena’s target and affected third parties, against the usefulness of the resulting evidence in administering justice.

The second harm is to security. To the extent that we focus on the subpoena threat, rather than the larger threats of intruders and insiders, we risk finding “solutions” that fail to solve our biggest problems. We might get lucky and end up with a solution that happens to address the bigger threats too. We might even design a solution for the bigger threats, and simply use subpoenas as a rhetorical device in explaining our solution — though it seems risky to mislead our audience about our motivations. If our solution flows from our threat model, as it should, then we need to be very careful to get our threat model right.

Steve Schultze to Join CITP as Associate Director

I’m thrilled to announce that Steve Schultze will be joining the Center for Information Technology Policy at Princeton, as our new Associate Director, starting September 15. We know Steve well, having followed his work as a fellow at the Berkman Center at Harvard, not to mention his collaboration with us on RECAP.

Steve embodies the cross-disciplinary, theory-meets-practice vibe of CITP. He has degrees in computer science, philosophy, and new media studies; he helped build a non-profit tech startup; and he has worked as a policy analyst in media, open access, and telecommunications. Steve is a strong organizer, communicator, and team-builder. When he arrives, he should hit the ground running.

Steve replaces David Robinson, who put in two exemplary years as our first Associate Director. We wish David continued success as he starts law school at Yale.

The next chapter in CITP’s growth starts in September, with a busy events calendar, a full slate of visiting fellows, and Steve Schultze helping to steer the ship.

AP's DRM Announcement: Much Ado About Nothing

Last week the Associated Press announced it would be developing some kind of online news registry to control use of news content. From AP’s press release:

The registry will employ a microformat for news developed by AP and which was endorsed two weeks ago by the Media Standards Trust, a London-based nonprofit research and development organization that has called on news organizations to adopt consistent news formats for online content. The microformat will essentially encapsulate AP and member content in an informational “wrapper” that includes a digital permissions framework that lets publishers specify how their content is to be used online and which also supplies the critical information needed to track and monitor its usage.

The registry also will enable content owners and publishers to more effectively manage and control digital use of their content, by providing detailed metrics on content consumption, payment services and enforcement support. It will support a variety of payment models, including pay walls.

It was hard to make sense of this, so I went looking for more information. AP posted a diagram of the system, which only adds to the confusion — your satisfaction with the diagram will be inversely proportional to your knowledge of the technology.

As far as I can tell, the underlying technology is based on hNews, a microformat for news, shown in the AP diagram, that was announced by AP and the Media Standards Trust two weeks before the recent AP announcement.

Unfortunately for AP, the hNews spec bears little resemblance to AP’s claims about it. hNews is a handy way of annotating news stories with information about the author, dateline, and so on. But it doesn’t “encapsulate” anything in a “wrapper”, nor does it do much of anything to facilitate metering, monitoring, or paywalls.

AP also says that hNews ” includes a digital permissions framework that lets publishers specify how their content is to be used online”. This may sound like a restrictive DRM scheme, aimed at clawing back the rights copyright grants to users. But read the fine print. hNews does include a “rights” field that can be attached to an article, but the rights field uses ccREL, the Creative Commons Rights Expression Language, whose definition states unequivocally that it does not limit users’ rights already granted by copyright and can only convey further rights to the user. Here’s the ccREL definition, page 9:

Here are the License properties defined as part of ccREL:

  • cc:permits — permits a particular use of the Work above and beyond what default copyright law allows.
  • cc:prohibits — prohibits a particular use of the Work, specifically affecting the scope of the permissions provided by cc:permits (but not reducing rights granted under copyright).

It seems that there is much less to the AP’s announcement than meets the eye. If there’s a story here, it’s in the mismatch between the modest and reasonable underlying technology, and AP’s grandiose claims for it.

What Economic Forces Drive Cloud Computing?

You know a technology trend is all-pervasive when you see New York Times op-eds about it — and this week saw the first Times op-ed about cloud computing, by Jonathan Zittrain. I hope to address JZ’s argument another day. Today I want to talk about a more basic issue: why we’re moving toward the cloud.

(Background: “Cloud computing” refers to the trend away from services provided by software running on standalone personal computers (“clients”), toward services provided across the Net with data stored in centralized data centers (“servers”). GMail and HotMail provide email in the cloud, Flickr provides photo albums in the cloud, and so on.)

The conventional wisdom is that functions are moving from the client to the server because server-side computing resources (storage, computation, and data transfer) are falling in cost, relative to the cost of client-side resources. Basic economics says that if a product uses two inputs, and the relative costs of the inputs change, production will shift to use more of the newly-cheap input and less of the newly-expensive one — so as server-side resources get relatively cheaper, designs will start to use more server-side resources and fewer client-side resources. (In fact, both server- and client-side resources are getting cheaper, but the argument still works as long as the cost of server-side resources is falling faster, which it probably is.)

This argument seems reasonable — and smart people have repeated it — but I think it misses the most important factors driving us into the cloud.

For starters, the standard argument assume that a move into the cloud simply relocates functions from client to server — we’re consuming the same resources, just consuming them in a place where they’re cheaper. But if you dig into the details, it looks like the cloud approach may use a lot more resources.

Rather than storing data on the client, the cloud approach often replicates data, storing the data on both server and client. If I use GMail on my laptop, my messages are stored on Google’s servers and on my laptop. Beyond that, some computation is replicated on both client and server, and we mustn’t forget that it’s less resource-efficient to provide computing inside a web browser than on the raw hardware. Add all of this up, and we might easily find that a cloud approach, uses more client-side resources than a client-only approach.

Why, then, are we moving into the cloud? The key issue is the cost of management. Thus far we focused only on computing resources such as storage, computation, and data transfer; but the cost of managing all of this — making sure the right software version is installed, that data is backed up, that spam filters are updated, and so on — is a significant part of the picture. Indeed, as the cost of computing resources, on both client and server sides, continues to fall rapidly, management becomes a bigger and bigger fraction of the total cost. And so we move toward an approach that minimizes management cost, even if that approach is relatively wasteful of computing resources. The key is not that we’re moving computation from client to server, but that we’re moving management to the server, where a team of experts can manage matters for many users.

This is still a story about shifts in the relative costs of inputs. The cost of computing is getting cheaper (wherever it happens), so we’re happy to use more computing resources in order to use our relatively expensive management inputs more efficiently.

What does tell us about the future of cloud services? That question will have to wait for another day.