$600K Grant to Fund Indiana University-Led HathiTrust Text-Mining Project
By David Rapp Aug 12, 2011The Alfred P. Sloan Foundation has awarded a $600,000 grant to a project to be led by the Data to Insight Center (D2I) at Indiana University (IU), exploring ways of conducting secure "non-consumptive" research using copyrighted digital works in the HathiTrust repository. D2I will partner on the project with the University of Michigan's Department of Electrical Engineering and the multi-institutional HathiTrust Research Center.
In non-consumptive research, massive data sets of text are analyzed using computerized algorithms. Such research can be freely done on public-domain works, which make up about 27 percent of the 9.5 million volumes in the HathiTrust corpus.
The newly funded research, however, will focus on ways in which non-consumptive research may be done on copyrighted materials in the repository securely—without fear of data leaks. Such security concerns are necessary because the plan is for outside researchers can submit their own algorithms to the HathiTrust Research Center, which will execute them on the researchers' behalf.
To protect the copyrighted data, the project will focus on creating a "data capsule framework" prototype, based on research by Atul Prakash of the University of Michigan, one of the project's principal investigators.
Another principal investigator, Beth Plale—the director of the Data To Insight Center and a computer science professor at IU—explained the "data capsule" as a sort of "virtual machine" of computer code that only approves specifically approved functions of a given algorithm, thus preventing possible leaks of copyrighted information.
HathiTrust launched the HathiTrust Research Center in partnership with technology centers at IU and the University of Illinois at Urbana-Champaign in April 2011, as LJ reported. The aim of the partnership is to explore potential text-mining projects using HathiTrust's digitized texts and the considerable computing power of the universities' research labs.







