December 26, 2024

Archives for 2009

Finding and Fixing Errors in Google's Book Catalog

There was a fascinating exchange about errors in Google’s book catalog over at the Language Log recently. We rarely see such an open and constructive discussion of errors in large data sets, so this is an unusual opportunity to learn about how errors arise and what can be done about them.

The exchange started with Geoffrey Nunberg pointing to many errors in the metadata associated with Google’s book search project. (Here “metadata” refers to the kind of information that would have been on a card in the card catalog of an traditional library: a book’s date of publication, subject classification, an so on.) Some of the errors are pretty amusing, including Dickens writing books before he was born, a Bob Dylan biography published in the nineteenth century, Moby Dick classified under “computers”. Nunberg called this a “train wreck” and blamed Google’s overaggressive use of computer analysis to extract bibliographic information from scanned images.

Things really got interesting when Google’s Jon Orwant replied (note that the red text starting “GN” is Nunberg’s response to Orwant), with an extraordinarily open and constructive discussion of how the errors described by Nunberg arose, and the problems Google faces in trying to ensure accuracy of a huge dataset drawn from diverse sources.

Orwant starts, for example, by acknowledging that Google’s metadata probably contains millions of errors. But he asserts that that is to be expected, at least at first: “we’ve learned the hard way that when you’re dealing with a trillion metadata fields, one-in-a-million errors happen a million times over.” If you take catalogs from many sources and aggregate them into a single meta-catalog — more or less what Google is doing — you’ll inherit all the errors of your sources, unless you’re extraordinarily clever and diligent in comparing different sources to sniff out likely errors.

To make things worse, the very power and flexibility of a digital index can raise the visibility of the errors that do exist, by making them easy to find. Want to find all of the books, anywhere in the world, written by Charles Dickens and (wrongly thought to be) published before 1850? Just type a simple query. Google’s search technology did a lot to help Nunberg find errors. But it’s not just error-hunters who will find more errors — if a more powerful metadata search facility is more useful, researchers will rely on it more, and will therefore be tripped up by more errors.

What’s most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google’s metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google’s metadata a “catalog” seems to connote a level of completion and immutability that Google might not assert. An electronic “card catalog” can change every day — a good thing if the changes are strict improvements such as error fixes — in a way that a traditional card catalog wouldn’t.

Over time, the errors Nunberg reported will be fixed, and as a side effect some errors with similar causes will be fixed too. Whether that is good enough remains to be seen.

When spammers try to go legitimate

I hate to sound like a broken record, complaining about professional mail distribution / spam-houses that are entirely unwilling to require their customers to follow a strict opt-in discipline. But I’m going to complain again and I’m going to name names.

Today, I got a spam touting a Citrix product (“Free virtualization training for you and your students!”). This message arrived in my mailbox with an unsubscribe link hosted by xmr3.com which bounced me back to a page at Citrix. The Citrix page then asks me for assorted personal information (name, email, country, employer). There was also a mailto link from xmr3 allowing me to opt-out.

At no time did I ever opt into any communication from Citrix. I’ve never done business with them. I don’t know anybody who works there. I could care less about their product.

What’s wrong here? A seemingly legitimate company is sending out spam to people who have never requested anything from them. They’re not employing any of the tactics that are normally employed by spammers to hide themselves. They’re not advertising drugs for sexual dysfunction or replicas of expensive watches. Maybe they got my email by surfing through faculty web pages. Maybe they got my email from some conference registration list. They’ve used a dubious third-party to distribute the spam who provides no method for indicating that their client is violating their terms of service (nor can their terms of service be found anywhere on their home page).

Based on this, it’s easy to advocate technical countermeasures (e.g., black-hole treatment for xmr3.com and citrix.com) or improvements to laws (the message appears to be superficially compliant with the CAN-SPAM act, but a detailed analysis would take more time than it’s worth). My hope is that we can maybe also apply some measure of shame. Citrix, as a company, should be embarrassed and ashamed to advertise itself this way. If it ever became culturally acceptable for companies to do this sort of thing, then the deluge of “legitimate” spam will be intolerable.

Subpoenas and Search Warrants as Security Threats

When I teach computer security, one of the first lessons is on the need to have a clear threat model, that is, a clearly defined statement of which harms you are trying to prevent, and what assumptions you are making about the capabilities and motivation of the adversaries who are trying to cause those harms. Many security failures stem from threat model confusion. Conversely, a good threat model often shapes the solution.

The same is true for security research: the solutions you develop will depend strongly on what threat you are trying to address.

Lately I’ve noticed more and more papers in the computer security research literature that include subpoenas and/or search warrants as part of their threat model. For example, the Vanish paper, which won Best Student Paper (the de facto best paper award) at the recent Usenix Security symposium, uses the word “subpoena” 13 times, in passages like this:

Attackers. Our motivation is to protect against retroactive data disclosures, e.g., in response to a subpoena, court order, malicious compromise of archived data, or accidental data leakage. For some of these cases, such as the subpoena, the party initiating the subpoena is the obvious “attacker.” The final attacker could be a user’s ex-husband’s lawyer, an insurance company, or a prosecutor. But executing a subpoena is a complex process involving many other actors …. For our purposes we define all the involved actors as the “adversary.”

(I don’t mean to single out this particular paper. This is just the paper I had at hand — others make the same move.)

Certainly, subpoenas are no fun for any of the parties involved. They’re costly to deal with, not to mention the ick factor inherent in compelled disclosure to a stranger, even if you’re totally blameless. And certainly, subpoenas are sometimes used to harass, rather than to gather legitimately relevant evidence. But are subpoenas really the biggest threat to email confidentiality? Are they anywhere close to the biggest threat? Almost certainly not.

Usually when the threat model mentions subpoenas, the bigger threats in reality come from malicious intruders or insiders. The biggest risk in storing my documents on CloudCorp’s servers is probably that somebody working at CloudCorp, or a contractor hired by them, will mess up or misbehave.

So why talk about subpoenas rather than intruders or insiders? Perhaps this kind of talk is more diplomatic than the alternative. If I’m talking about the risks of Gmail, I might prefer not to point out that my friends at Google could hire someone who is less than diligent, or less than honest. If I talk about subpoenas as the threat, nobody in the room is offended, and the security measures I recommend might still be useful against intruders and insiders. It’s more polite to talk about data losses that are compelled by a mysterious, powerful Other — in this case an Anonymous Lawyer.

Politeness aside, overemphasizing subpoena threats can be harmful in at least two ways. First, we can easily forget that enforcement of subpoenas is often, though not always, in society’s interest. Our legal system works better when fact-finders have access to a broader range of truthful evidence. That’s why we have subpoenas in the first place. Not all subpoenas are good — and in some places with corrupt or evil legal systems, subpoenas deserve no legitimacy at all — but we mustn’t lose sight of society’s desire to balance the very real cost imposed on the subpoena’s target and affected third parties, against the usefulness of the resulting evidence in administering justice.

The second harm is to security. To the extent that we focus on the subpoena threat, rather than the larger threats of intruders and insiders, we risk finding “solutions” that fail to solve our biggest problems. We might get lucky and end up with a solution that happens to address the bigger threats too. We might even design a solution for the bigger threats, and simply use subpoenas as a rhetorical device in explaining our solution — though it seems risky to mislead our audience about our motivations. If our solution flows from our threat model, as it should, then we need to be very careful to get our threat model right.

Steve Schultze to Join CITP as Associate Director

I’m thrilled to announce that Steve Schultze will be joining the Center for Information Technology Policy at Princeton, as our new Associate Director, starting September 15. We know Steve well, having followed his work as a fellow at the Berkman Center at Harvard, not to mention his collaboration with us on RECAP.

Steve embodies the cross-disciplinary, theory-meets-practice vibe of CITP. He has degrees in computer science, philosophy, and new media studies; he helped build a non-profit tech startup; and he has worked as a policy analyst in media, open access, and telecommunications. Steve is a strong organizer, communicator, and team-builder. When he arrives, he should hit the ground running.

Steve replaces David Robinson, who put in two exemplary years as our first Associate Director. We wish David continued success as he starts law school at Yale.

The next chapter in CITP’s growth starts in September, with a busy events calendar, a full slate of visiting fellows, and Steve Schultze helping to steer the ship.

The Trouble with PACER Fees

One sentiment I’ve seen in a number of people express about our release of RECAP is illustrated by this comment here at Freedom to Tinker:

Technically impressive, but also shortsighted. There appears a socialistic cultural trend that seeks to disconnect individual accountability to ones choices. $.08 a page is hardly burdensome or profitable, and clearly goes to offset costs. If additional taxes are required to make up the shortfall RECAP seems likely to create, we all will pay more in general taxes even though only a small few ever access PACER.

Now, I don’t think anyone who’s familiar with my work would accuse me of harboring socialistic sympathies. RECAP has earned the endorsement of my colleague Jim Harper of the libertarian Cato Institute and Christopher Farrell of the conservative watchdog group Judicial Watch. Those guys are not socialists.

Still, there’s a fair question here: under the model we advocate, taxpayers might wind up picking up some of the costs currently being bourne by PACER users. Why should taxpayers in general pay for a service that only a tiny fraction of the population will ever use?

I think there are two answers. The narrow answer is that this misunderstands where the costs of PACER come from. There are four distinct steps in the process of publishing a judicial record. First, someone has to create the document. This is done by a judge in some cases and by private litigants in others. Second, someone has to convert the document to electronic format. This is a small and falling cost, because both judges and litigants increasingly produce documents using word processors, so they’re digital from their inception. Third, someone has to redact the documents to ensure private information doesn’t leak out. This is supposed to be done by private parties when they submit documents, but they don’t always do what they’re supposed to, necessitating extra work by court personnel. Finally, the documents need to be uploaded to a website where they can be downloaded by the public.

The key thing to understand here is that the first three steps are things that the courts would be doing anyway if PACER didn’t exist. Court documents were already public records before PACER came onto the scene. Anyone could (and still can) drive down to the courthouse, request any unsealed document they want, and make as many photocopies as they wish. Moreover, even if documents weren’t public, the courts would likely still be transitioning to an electronic system for their internal use.

So this means that the only additional cost of PACER, beyond the activities the courts would be doing anyway, is the web servers, bandwidth, and personnel required to run the PACER web sites themselves. But RECAP users imposes no additional load on PACER’s servers. Users download RECAP documents directly from the Internet Archive. So RECAP is entirely consistent with the principle that PACER users should pay for the resources they use.

I think there’s also a deeper answer to this question, which is that it misunderstands the role of the judiciary in a free society. The service the judiciary provides to the public is not the production of individual documents or even the resolution of individual cases. The service it provides is the maintenance of a comprehensive and predictable system of law that is the foundation for our civilization. You benefit from this system whether or not you ever appear in court because it gives you confidence that your rights will be protected in a fair and predictable manner. And in particular, you benefit from judicial transparency because transparency improves accountability. Even if you’re not personally interested in monitoring the judiciary for abuses, you benefit when other people do so.

This is something I take personally because I’ve done a bit of legal reporting myself. I obviously get some direct benefits from doing this—I sometimes get paid and it’s always fun to have people read my work. But I like to think that my writing about the law also benefits society at large by increasing public understanding and scrutiny of the judicial system. And charging for access to the law will be most discouraging to people like me who are trying to do more than just win a particular court case. Journalists, public interest advocates, academics, and the like generally don’t have clients they can bill for the expense of legal research, so PACER fees are a significant barrier.

There’s no conflict between a belief in free markets and a belief that everyone is entitled to information about the legal system that governs their lives. To the contrary, free markets depend on the “rules of the game” being fair and predictable. The kind of judicial transparency that RECAP hopes to foster only furthers that goal.