Advertisement
Articles

Text-Mining Ahead: HathiTrust Research Center to Open Corpus to Researchers

E-Mail This Link


Enter recipient's e-mail:


Close
Email
Print |
RSS |
Share | |
By David Rapp Apr 28, 2011

The digital repository HathiTrust, in partnership with technology centers at Indiana University and the University of Illinois at Urbana-Champaign, recently launched the HathiTrust Research Center, which will leverage the two academic institutions' considerable computing power, and HathiTrust's millions of digitized words, for text-mining projects serving the humanities.

The HathiTrust repository contains nearly 8.6 million digitized volumes, most the fruit of Google's scanning efforts. Initial text-mining efforts, however, will only explore public-domain materials, though the infrastructure could easily be scaled up to cover all materials. About 2.2 million—roughly 26 percent—of the HathiTrust's works are in the public domain, and University of Michigan Library researchers are unearthing even more via the Copyright Review Project.

Fostering "non-consumptive" research
Under the proposed Google settlement, rejected by Circuit Court Judge Denny Chin last month, copyrighted works Google had digitized would have been made open to in-depth textual analysis by researchers.

It's called "non-consumptive" research, as researchers wouldn't be reading, or consuming, the material; they would simply be analyzing text—from simple repetitions of words to far more complex linguistic structures. It would allow analysis that would be logistically impossible with physical books. (Google's Ngram Viewer provides an example of how this kind of analysis can be used to hunt for specific words in such a massive data set.)

Copyright limits the extent of what can be analyzed, but public-domain materials stored by HathiTrust are unquestionably fair game. And one thing is certain: the computing infrastructure necessary to run such massive analyses on millions of works is considerable—the kind available only at a limited number of academic institutions.

Beth Plale, a computer science professor and director of the Data to Insight center at Indiana University, told LJAN that the Center envisions a system in which researchers can submit complex algorithms to the center which will use the currently available, and significant, computing resources at both universities to execute those algorithms on the researchers' behalf.

First steps
The Center has also reached out to researchers at other universities, such as the University of Chicago, who are working on text-mining and data retrieval projects, said Plale. The Center could, for example, initially create specialized collections in the HathiTrust catalog concentrating on specific subjects.

The Center has also been in discussion with Project Bamboo—an Andrew A. Mellon Foundation-funded partnership of ten universities, including Indiana and Illinois, that aims to develop humanities research technologies—about making the two projects interoperable at the service and software level, "so our systems can talk to one another," Plale said.

At this early stage the Research Center is working on necessary computing infrastructure. Internal funding is supporting this initial phase, as a proof-of-concept prototype gets up and running, said Plale. But as it grows, it will undoubtedly need to expand, and to that end it has applied for a Sloan Foundation grant. The rejected Google settlement also provided for millions of dollars for research projects—and the Center would be a likely candidate for such funding if a similar settlement eventually goes through.

Plale said that the Center plans to host an open meeting to discuss the project at a digital humanities seminar at Stanford University in June.




Reader Comments (2)


I am curious if HathiTrust plans to allow 'non-consumptive' research access to even in-copyright materials based on the assumption/argument that this is fair use, even in the absence of the settlement.

Posted by Jonathan Rochkind on May 2, 2011 06:51:07PM

"And one thing is certain: the computing infrastructure necessary to run such massive analyses on millions of works is considerable—the kind available only at a limited number of academic institutions." Well, right now maybe, but give Moore's Law a little time and your little smart phone will be able to do everything you'll want.

Posted by John on May 2, 2011 09:42:29PM

Previous | Next

Comments that include profanity, personal attacks, or antisocial behavior such as "spamming", "trolling", or any other inappropriate material will be removed from the site. We will take steps to block users who violate any of our terms of use. You are fully responsible for the content you post. All comments must comply with the Terms and Conditions of this site and by submitting comments you confirm your agreement to these Terms and Conditions.

Your name: *

Your email address: * (We won't publish this.)



* = Required information


 

Welcome the LJ Archives.

This archive site is the home to all LJ articles published prior to January 2012;
Advertisement

LJ Reviews Database

LJ Reviews Center

Latest Stories



From the Blogs



Advertisement

Advertisement

Connect with Library Journal


Follow on Twitter








About Us | Advertising Information | Submissions | Site Map | Contact Us | RSS | Subscriptions
©2011 Media Source, Inc., All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc. Media Source Inc.