Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

The LJ Academic Newswire Newsmaker Interview: Brad Wheeler on the HathiTrust

Josh Hadro -- Library Journal, 1/9/2009

Go back to the
Academic Newswire
for more stories

(This interview first appeared in the January 8 issue of the LJ Academic Newswire.)

The launch of the HathiTrust repository made quite a splash in 2008, landing at position number four in the LJAN list of the ten most important academic library stories of the past year. In the LJAN interview with HathiTrust executive director John Wilkin, we heard all about the philosophy behind the project and its unprecedented scale, but what about the nuts and bolts of preserving so many terabytes of research materials for posterity? To get a handle on the technology driving this mammoth effort, we recently caught up with Brad Wheeler, Indiana University chief information officer and executive board member of the HathiTrust.

LJAN: We all know the story of the Library of Babel, in which Borges describes the stacks of the infinite library, but never their construction nor their maintenance — so what's the technology behind an access and preservation endeavor like HathiTrust?
BW: HathiTrust uses design principles that focus on supporting large scale implementation, and that is essential for Hathi's ambitious mission. For example, the Isilon [clustered storage systems] uses a notion of virtualization to help ingest very large numbers of items. Moreover, redundancy is managed in a way that factors in the large numbers of items and disks. For example, it’s very important to be able to ingest a volume and not have to worry about where to put it.

Of course file storage does have an element of “where-ness” to it (you’ll note, for example, that we use what's called the pairtree directory naming convention, which makes ingesting or locating a particular volume self-evident), but you would want to avoid what happens in the conventional stacks, with “shifts” that need to take place periodically and where large areas are unused just in case. The virtualization gives us that.

In the same way, replication is something designed to work at scale (e.g., we’ll typically push tens of terabytes over the net every month, all automatically, as part of that effort to make sure that we have the same thing in more than one geographic location). Servers also use a degree of virtualization to spread processes over many boxes, and this, along with load balancing over each of the sites — currently at the University of Michigan and Indiana University (IU) — ensures that users can be served more reliably through better performance and, just in case, fail over [to redundant backup systems].

And how is all this managed?
The management of a system like this requires planning and coordination between system builders and managers at each of the sites. IU and Michigan involved our network engineers to ensure a clear 10Gbps connection between the sites from machine to machine and they had to do substantial tuning of the buffers on both ends to effectively use the bandwidth. Just the same, though this sort of dialog is important, processes tend to be automated.

So how do the digital preservation needs of a project on this scale differ from those of, say, a single smaller institutional repository?
One of the most important elements in the digital preservation strategy is the formality involved. In an effort like this, the trust of the participants depends on little being taken for granted and much being communicated broadly. You may have noticed the preliminary attention to the Trustworthy Repository Audit and Certification checklist. We are working with the Center for Research Libraries to give a fuller, formal assessment of the repository. We are currently working on a mechanism that provides meaningful monthly reports to participating institutions, giving specific attention to the contents deposited by that institution.

It's obviously difficult to search the full text of millions of books all in one shot — can you describe how the HathiTrust makes use of "shards," and how they help address some of the problems of searching across such a large corpus?
There are two primary elements that come to the fore with large-scale search: first, simply being able to search this amount of content expeditiously, and second, providing meaningful results. Shards come into play in the first instance. With so much content, a widely accepted approach is to break up the content into smaller chunks for indexing, and each of the resulting index “shards” is addressed by the search engine before presenting the results to the user. This allows the system to optimize searching around the resources available to it. For example, hypothetically, one might build indexes on 500,000 volumes at a time rather than one or two million volumes because the system configuration being used can fit the bulk of the 500,000 document index into memory. Of course when using shards the results are aggregated before being presented to the user, but by chunking things in this way, the system can work more quickly.

The second issue, sense-making, is particularly challenging, simply because the results presented to the user may also be so large. There are many reasonable approaches to providing results from a body of content this large. One approach involves using presentation strategies like facets, incorporating elements from the bibliographic data. Others might involve complex ranking strategies, scaffolding or other techniques. This is clearly a challenging problem that we’ll learn more about through experience.

There are a couple of beta and test interfaces to the materials, but there's no official search function for the entire collection as of yet — how will the eventual public search interface factor into the overall project?
We believe that having a front door, so to speak, is important for HathiTrust, so readers should expect to see a public search interface out there in the next few months. You’ll probably have noticed that we’ve planned this as a multi-stage effort, with a more ambitious interface emerging over time and replacing the initial beta that we put in place in that February or March timeframe. That said, we do everything we can to make it possible for an individual institution to reflect the relationship between content stored in HathiTrust and held in print by that institution. We think this will be a key tool for our institutions as they help their users understand a broad definition of “holdings” and as we coordinate efforts around, for example, shared print storage strategies.

What sort of an endeavor was the rights management database, and what role does it play in giving other institutions access to materials in the HathiTrust repository?
The rights management database involved much planning in its design to make sure that it was flexible enough to accommodate an increasingly rich set of relationships between materials and their users. It’s important to mention that, currently, access to the rights management database is limited to trusted applications within the system, and before we open broader access, we’ll need to work through issues related to security. Because “authorization” takes place by consulting the rights management database and making determinations based on the user, it makes it possible to provide access under certain circumstances to one institution but not another (e.g., under Section 108 of US copyright law). Little in the way of this sort of institutional discrimination currently takes place, but we’ll probably see more of that in the future, including access to licensed resources for members of HathiTrust.

How does HathiTrust serve as a model for future academic library endeavors?
I think HathiTrust is illustrative of one strategy that colleges and universities will use for cloud services that span campuses. In some cases, we’ll use commercial services like Google/Microsoft for sourcing common things like student email. In some cases, we’ll organize ownership of a cloud service and then contract its operation to a hosting provider. HathiTrust represents part of our core business in research and preservation. Thus, we have developed a university-owned and operated service that is developing its own competencies and content archive for this service. Don’t fail to miss the remarkable accomplishment of bringing the Committee on Institutional Cooperation institutions, the University of California campuses, and others into one collective project. This is an aggregation of essential scale that demonstrates new paths for shared infrastructure/resources among colleges and universities. Michigan and Indiana, as operators of the HathiTrust infrastructure, have benefitted tremendously from our past cooperation in the Sakai Project for open source courseware management software. Learning to work together across institutions to deliver production services is an important capability for the foreseeable future.

Read more Newswire stories:

RIAA Says It Has Ended P2P Litigation Spree

New York’s Gotham Book Mart Collection Donated to Penn Libraries

EBSCOhost releases Content Viewer; PALINET/ SOLINET merger progresses; BCR, BiblioLife, and Ingram team up to offer Shelf2Life

Bestsellers in Latin American History

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

Sponsored Links




 
Advertisement
Sponsored Links

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs


Sorry, no blogs are active for this topic.

» VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS

Click on a title below to learn more.

LJ BookSmack
LJXPRESS
LJ ACADEMIC NEWSWIRE
LJ REVIEW ALERT
LJ Criticas Review Alert
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites