April 23, 2014

avatar

Finding and Fixing Errors in Google's Book Catalog

There was a fascinating exchange about errors in Google’s book catalog over at the Language Log recently. We rarely see such an open and constructive discussion of errors in large data sets, so this is an unusual opportunity to learn about how errors arise and what can be done about them.

The exchange started with Geoffrey Nunberg pointing to many errors in the metadata associated with Google’s book search project. (Here “metadata” refers to the kind of information that would have been on a card in the card catalog of an traditional library: a book’s date of publication, subject classification, an so on.) Some of the errors are pretty amusing, including Dickens writing books before he was born, a Bob Dylan biography published in the nineteenth century, Moby Dick classified under “computers”. Nunberg called this a “train wreck” and blamed Google’s overaggressive use of computer analysis to extract bibliographic information from scanned images.

Things really got interesting when Google’s Jon Orwant replied (note that the red text starting “GN” is Nunberg’s response to Orwant), with an extraordinarily open and constructive discussion of how the errors described by Nunberg arose, and the problems Google faces in trying to ensure accuracy of a huge dataset drawn from diverse sources.

Orwant starts, for example, by acknowledging that Google’s metadata probably contains millions of errors. But he asserts that that is to be expected, at least at first: “we’ve learned the hard way that when you’re dealing with a trillion metadata fields, one-in-a-million errors happen a million times over.” If you take catalogs from many sources and aggregate them into a single meta-catalog — more or less what Google is doing — you’ll inherit all the errors of your sources, unless you’re extraordinarily clever and diligent in comparing different sources to sniff out likely errors.

To make things worse, the very power and flexibility of a digital index can raise the visibility of the errors that do exist, by making them easy to find. Want to find all of the books, anywhere in the world, written by Charles Dickens and (wrongly thought to be) published before 1850? Just type a simple query. Google’s search technology did a lot to help Nunberg find errors. But it’s not just error-hunters who will find more errors — if a more powerful metadata search facility is more useful, researchers will rely on it more, and will therefore be tripped up by more errors.

What’s most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google’s metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google’s metadata a “catalog” seems to connote a level of completion and immutability that Google might not assert. An electronic “card catalog” can change every day — a good thing if the changes are strict improvements such as error fixes — in a way that a traditional card catalog wouldn’t.

Over time, the errors Nunberg reported will be fixed, and as a side effect some errors with similar causes will be fixed too. Whether that is good enough remains to be seen.

Comments

  1. brent s says:

    I love the way google does all these huge public projects completely for free with an open attitude of build it, release it, fix it – and people who are used to the government model of build it, fix it, fix it, fix it, rebuild it, never release it demand better services from google.

    why, exactly, is it even google’s job to do this stuff? and who, exactly, is offering to do this stuff if google doesn’t?

    I think the standard response when you discover someone else’s error is ‘sorry’ – where they should say ‘aha! thankyou!’

  2. Eric Hellman says:

    Library catalogs as they exist today are themselves becoming less and less fixed representations of static collections. They are never “complete” or “immutable”. They have had to evolve to deal with electronic resources which change from week to week, even from day to day. It’s been a long time since any serious library has used cards.

    Nunberg’s potshots at Google are amusing, but ultimately unfair. My post on the subject (with a Princeton angle!) is here: http://go-to-hellman.blogspot.com/2009/09/white-dielectric-substance-in-library.html

  3. Rick says:

    This just shows that even the mighty Google isn’t perfect. I guess nothing is. My understanding is that Google is trying to involve the public in determining the actual performances of their systems. It’s like a systems testing procedure. With all those bugs and errors being brought to their attention, I’m pretty sure that Google will have a solid data to work on.

    • Michelle says:

      I agree with Rick, nothing is perfect.

      Google is actually trying to create the best search-engine experience, and they did a very good job. When Google first debuted, there were more popular search-engines. But Google became the major player over time, proving that their systems and algorithms are very powerful– and relevant.

      -Michelle

  4. Anonymous says:

    Nunberg points out that you can find certain kinds of information with more accuracy using WorldCat than using Google Books. And that’s true, so far as it goes. But what he doesn’t say is that WorldCat has its own problems. WorldCat is great as an inventory database, collocating different editions, etc. It usually gets its dates right too. But depending on the type of search, WorldCat can on many occasions be a rather horrible discovery tool. Its metadata is largely based on the Library of Congress Subject Headings, which are outdated and often useless. Now if OCLC and Google could just figure out how to work better together, that would be swell. I think they could complement each other quite nicely.