Google, “The Last Library,” and Millions of Metadata Mistakes
Search giant says it's improving, and that a massive project inevitably means a percentage of errors
Norman Oder -- Library Journal, 9/3/2009
| Go back to the Academic Newswire for more stories |
One of the most buzzed-about criticisms of the Google Book Search project, and thus the pending settlement, emerged at a conference last Friday, when Geoff Nunberg, a linguistics expert and adjunct full professor at UC Berkeley's School of Information., called Google’s project likely “the Last Library.” He proceeded to slam the quality of Google’s metadata, pointing to numerous egregious misclassifications and dates and calling the situation “a train wreck: a mish-mash wrapped in a muddle wrapped in a mess” (for LJAN columnist Barbara Fister's take, see "The Last Library is Greater than Google").
Google Book Search point man Dan Clancy, according to The Register, said “I don't view Google Book Search as the one and only library,” though he acknowledged that no other competitor likely would have entered the market or taken the legal risk to scan first and negotiate copyright later.
Following Nunberg's post, a Google engineer responded directly, acknowledging that a project this huge inevitably means many errors, and said that the company was already refining its approach.
The Last Library?
Nunberg posted his presentation at LanguageLog, titled A Metadata Train Wreck, stating, “This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now—Google? UNESCO? Wal-Mart?—these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.”
Nunberg noted that keyword search works fine for casual research, “[b]ut for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter ‘I contain multitudes’ in the search box and hope for the best.”
He noted that, for some reason, GBS reports the 1899 publication of books ranging from Raymond Chandler's Killer in the Rain to Fodor's Guide to Nova Scotia. He cited endemic rather than sporadic errors: “Of the first ten hits for Tristram Shandy, four are classified as fiction, four as ‘Family & Relationships,’ one as ‘Biography & Autobiography,’ and one is not classified.” Nunberg criticized Clancy for placing the blame on libraries and publishers.
Numberg noted that the categories are drawn from the BISAC codes used by the book industry, not libraries, and questioned why Google chose to use the BISAC categories—“well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing,” not a major library. (Indeed, some libraries have begun to drop Dewey for BISAC.)
He noted that Google apparently did acquire the library records for scanned books along with the scans, but hasn’t licensed them for display or use.
“Beyond clearing up the obvious errors, the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right,” Nunberg wrote. “That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Book Settlement over the coming month.”
Google’s response
Google’ Jon Orwant, who manages the Google Books metadata team, responded at length:
First, we know we have problems. Oh lordy we have problems. Geoff refers to us having hundreds of thousands of errors. I wish it were so. We have millions. We have collected over a trillion individual metadata fields; when we use our computing grid to shake them around and decide which books exist in the world, we make billions of decisions and commit millions of mistakes. Some of them are eminently avoidable; others persist because we are at the mercy of the data available to us. The quality of our metadata now is a lot better than it was six months ago, and it'll be better still six months from now. We will never stop improving it.
"We have a cacophony of metadata sources—over a hundred—and they often conflict,” he added, contrasting that with library cataloging practices. “Without good metadata, effective search is impossible, and Google really wants to get search right.
Why 1899?
Orwant explained the prevalence of 1899: “we recently began incorporating metadata from a Brazilian metadata provider that, unbeknownst to us, used 1899 as the default date when they had no other…. We've special cased this provider so that their 1899 dates—and theirs alone—are ignored. You should see the improvements live on Google Books by the end of September.”
Another metadata provider used 1905 as a default date—“this, by the way, is a large part of the reason why there are so many books purportedly mentioning ‘Internet’ prior to 1950.” Orwant suggested Nunberg misheard Clancy as blaming libraries: “I talk to Dan all the time about metadata and can assure you that he has a thorough understanding of the problems.”
While Nunberg suggested that errors came from trying to automate the extraction of pub dates from the OCR'd text, Orwant said they were human errors—which alarmed Nunberg. He responded: “Which only goes to show that the Turing test can work both ways: do something dumb enough, and it's hard to tell you from a machine.”
Classification errors & BISAC
Orwant noted, “When we lack a BISAC category for a book, we try to guess one. We guess correctly about 90% of the time and Geoff's comments prompted the engineer responsible to suggest some improvements that we will roll out over the coming months.”
“While it's true that BISAC didn't exist when many of these books were published, it's not the case that Google necessarily invented the BISAC classifications for them," he wrote. "Sometimes we did, but often commercial metadata providers (not publishers or libraries) provided them, for the benefit of retailers.”
“Geoff asks why we decided to infer BISAC subjects in the first place. There is only one reason: we thought our end users would find it useful,” Orwant wrote. “If the accuracy needed is in excess of what we can provide, we'll simply stop inferring BISAC subjects and chalk it up to a failed experiment.”
Nunberg questioned whether Google had spoken to scholars, who’d instead prefer a library classification scheme.
Improvements coming
Orwant noted that “Geoff's efforts will have singlehandedly improved nearly one million metadata records in our repository once the code changes that his blog post inspired wend their way through our systems. While I winced at times reading his message and the conclusions he drew about our intentions and abilities, I can't deny that he's done Google a great service via his research.”
OCR questions
In another criticism of Google quality control, Computer Shopper reported that Google’s free eBook downloads in ePub format for titles out of copyright suffer from errors in the OCR process, making some completely unreadable.
Read more Newswire stories:
Library Groups Step Up Criticism of Google Settlement; Some Academic Institutions Support It
Clues About Europe's Digital Library Future in Europeana Report
Bumps Along the Road for SMU's Bush Library
Columns:
From the Bell Tower—When Every Student Has a Kindle
Peer to Peer Review—The Last Library Is Greater than Google
Architectural Questionnaire for Academic Libraries
People
Best Sellers in Business–Economics























