Advertisement
Articles

Unlocking HathiTrust: Inside the Librarians' Digital Library 

What every librarian should know about this huge “digital library by libraries for libraries”

E-Mail This Link


Enter recipient's e-mail:


Close
Email
Print |
RSS |
Share | |
By Char Booth with Heather Christenson & Paul Fogel
Jul 15, 2011

For the extended version of this article, see the original that ran in LJ's Academic Newswire

I had a great lesson in misperception correction not long ago as I was writing an LJ post called “A Rising Tide: The Academic User and the Ebook Experience” (see ow.ly/5hDSw), which examined academic ebooks from a (critical) user perspective. While working with a faculty member at the University of California (UC)–Berkeley on trademark research, we discovered a series of promising volumes in HathiTrust—an emerging digital library stemming from research library collaboration with Google Books and other initiatives—only to find that poor scanning and metadata made the volumes virtually unusable.

At the time, I took this as evidence of widespread and often discussed problems within large digitized collections. In reality, I had stumbled upon a rather unusual error that, once brought to their attention, was rapidly addressed by Hathi­Trust staff.

Even after this positive resolution, lingering assumptions still led me unintentionally to mischaracterize the digital library landscape as I wrote. Luckily, when I worked with Hathi­Trust on fact-checking, I received a rude awakening (albeit politely communicated). After being graciously schooled on a point or two and rewriting half of the piece, I started to wonder why, as a public services librarian and end user, I had harbored so many misplaced beliefs about digital libraries in the first place.

From the inside out
The more I communicated with HathiTrust personnel, the more I realized that they are a surprisingly small group of library-minded folks doing a herculean job not only participating in mass digitization projects with Google and the Internet Archive but building a new, large-scale digital library with its own features and services. Enter the inspiration for this follow-up interview: to correct my (and others’) misperceptions about this important and emerging librarians’ digital library. If you don’t know what HathiTrust is yet, you should. I set out to ask Heather Christenson and Paul Fogel, two individuals on Hathi­Trust’s front lines, to offer a rare inside view of this vast (and growing) collection.

[For the extended interview from which this is excerpted, see ow.ly/5hxbM.]

Booth What’s the HathiTrust elevator pitch?
Christenson In a world where commercial forces have staked out wide swaths of digital territory, HathiTrust is a digital library by libraries for libraries—and on a huge scale. We continue and extend libraries’ traditional role—building, curating, and preserving useful collections and providing the best access possible to patrons—in the digital realm. HathiTrust is a collaborative enterprise, supported by funding and in-kind contributions from participating libraries.

And what’s with the name?
Christenson Yes, that is a popular question! The name was chosen to express the values of our organization. Hathi (pronounced hah-tee) is the Hindi word for elephant, an animal noted for its memory, wisdom, and strength—that’s the origin of the elephant on our logo. Trust is a core value of libraries.

How did HathiTrust come about?
Christenson In 2008, major university libraries across the United States came together to provide a common point to preserve and provide access to the millions of books they have digitized with Google and other partners. Several busy and collaborative years later, HathiTrust is a robust shared repository of books (and book-like volumes) that continues to grow at a rapid pace. The University of California is one of its founding partners, and the California Digital Library [CDL]coordinates and contributes to activities across the UC campuses related to mass digitization (including ­HathiTrust).

How is HathiTrust used?
Fogel You can read books. You can download public domain volumes, either a page at a time or the full book in some cases. You can build your own collections of books, share them with others, view others’ collections and search across the full text of any (public) collection. Libraries have tons of access points into Hathi: APIs, metadata feeds, datasets, widgets. These can be used to build functionality that is limited only by the developer’s creativity. Mash it up!

Who coordinates and contributes?
Christenson It’s an evolving process—HathiTrust’s infrastructure is hosted at the University of Michigan (with a backup at Indiana University), so our UM team has provided primary support and development. Other partners have contributed to development work; for example, California Digital Library collaborated closely with Michigan to build the path to get Internet Archive–digitized volumes into Hathi­Trust, paving the way for other ­partners.

What does HathiTrust do that Google Books or the Internet Archive doesn’t?
Christenson One of the most important distinctions is that HathiTrust has a stated intention to preserve digital volumes over the long term. Our goal is for the researcher to be able to use these items in 20 years, 50 years, and onward. Although much of the content in HathiTrust is the product of scanning partnerships with Google, HathiTrust has an increasing amount of content not found in Google Books. We have included content from other major digitization projects such as the Internet Archive and Microsoft-funded work with the Internet Archive and other digitization agents. Some HathiTrust partners such as UM have contributed locally digitized collections as well.

HathiTrust also has a number of services and initiatives to make our data open and available, for example, we provide a regularly updated file of the aggregate HathiTrust bibliographic data so that libraries can tap into it to provide links in local catalogs and discovery tools. The HathiTrust Research Center has just been launched in an agreement with Indiana University and University of Illinois where scholars will soon be able to utilize the entire full text of our public domain volumes for computational research.

How is HathiTrust important to libraries, writ large?
Fogel There is a lot of talk right now about a national digital library, and to me HathiTrust is a key effort in that direction that is already a reality. There are clear advantages to aggregation at this scale: breadth of coverage; heightened visibility; savings for individual institutions for content storage and resource sharing; the development of community standards and technical expertise. Ultimately, it means that more users’ needs are served. Libraries have long focused on preserving print collections, and it only follows that they should also preserve digital materials. But preservation should be in the service of access and use.

Christenson [HathiTrust] benefits from an economy of scale, driving storage costs down for a digital repository and services that we all share. With pooled records, we can take a more holistic look at our metadata and solve problems specific to how metadata designed for print associates with digital content and how that plays out in discovery services.

Hathi2(Original Import)

What is HathiTrust doing to make works more open?
Christenson Since 2008, UM has been working on an IMLS-funded distributed system called the Copyright Review Management System (CRMS), including a database, that facilitates this research and enables us to share the work. A number of HathiTrust partners are now contributing to this effort, and many more will follow as the work expands. Through this work, over 125,000 volumes have already been released into the public domain. [See ow.ly/5hDGd for LJ’s coverage of “HathiTrust’s Copyright Detectives.”—Ed.]

What are some of the technical challenges?
Christenson In some ways, technical matters are far easier than policy and organizational challenges. I would say that the HathiTrust approach to technical challenges so far has been “bring ’em on!” The HathiTrust full-text search is a case in point. UM developers built the full-text search, and as the Hathi­Trust partnership grew and content ballooned, they kept up, eventually hitting the boundaries of the indexing formatting technology we were using and requiring an enhancement from the developers of this particular open source “search engine library.” We’re now indexing literally billions of pages.

What implications does HathiTrust have for print?
Fogel I hear that question a lot. Storing books is expensive. Given the state of library funding these days, everyone is trying to find ways to save money. But to my thinking, Hathi provides additional access points and new services such as full- text search, data mining, etc., rather than an alternative to print collections. It is in the service of making all of the stuff in those books that much easier to discover.

Christenson I think we have a lot more to learn about what user behavior surrounding the digital vs. print dynamic really is. And, of course, books that are in copyright are not viewable online, so that certainly must factor into decisions about print.


Author Information
Char Booth (charbooth@gmail.com) is Instruction Services Manager & E-Learning Librarian at Claremont Colleges Library, CA. She blogs at infomational.com, tweets @charbooth, and is the author of Reflective Teaching, Effective Learning and Informing Innovation. Heather Christenson is Mass Digitization Project Manager and HathiTrust Project Manager at the University of California’s California Digital Library (CDL). A restaurant industry and dot-bomb survivor, Paul Fogel is currently Technical Lead for Mass Digitization and Cotechnical Lead for the HathiTrust at CDL



Reader Comments (0)


Previous | Next

Comments that include profanity, personal attacks, or antisocial behavior such as "spamming", "trolling", or any other inappropriate material will be removed from the site. We will take steps to block users who violate any of our terms of use. You are fully responsible for the content you post. All comments must comply with the Terms and Conditions of this site and by submitting comments you confirm your agreement to these Terms and Conditions.

Your name: *

Your email address: * (We won't publish this.)



* = Required information


 

Welcome the LJ Archives.

This archive site is the home to all LJ articles published prior to January 2012;
Advertisement

LJ Reviews Database

LJ Reviews Center

Latest Stories



From the Blogs



Advertisement

Advertisement

Connect with Library Journal


Follow on Twitter








About Us | Advertising Information | Submissions | Site Map | Contact Us | RSS | Subscriptions
©2011 Media Source, Inc., All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc.