Link This |
Email this |
Blog This |
Comments (6)
Hathi Trust Fun
August 25, 2008
The University of Michigan Library led the way in putting the books online that Google was digitizing from their collection, and to this day remains one of the few (the only?) to have made the effort. They were subsequently joined by the other CIC institutions (
Committee on Institutional Cooperation) in a joint effort to mount the texts being digitized from all CIC institutions. The resulting effort is called, oddly enough, the
Hathi Trust.
The Hathi Trust is building a shared digital repository, and although there isn't any information on the web site yet, there is information at the University of Indiana
about the project. In reading that, I noticed that they were offering brief records describing all of the contents of the repository. Since I find it difficult to resist such temptation (as my earlier effort to make the records of the
University of Michigan's public domain books searchable will attest), I contacted them and got the records.
They come in one compressed tab-delimited file, with a very
abbreviated record for each title. Nonetheless, I thought it would still be fun to search it and see what I could find. So in an incredibly misguided implementation, I chopped it into 1.4 million tiny little XML files (yes, I am an idiot), and indexed it with my favorite indexer,
Swish-e (which is XML-aware and can thus provide fielded searching as in a database). The result can be seen and used on my prototype server as the
Hathi Trust Search. Since this is my prototype server I don't worry about making it pretty or even in making it fully functional, which you will realize if you try to browse past about the first 5,000 hits of any search. So
sue me.
Meanwhile, have fun. Take a look and see what's there or isn't on your favorite topics. Find out what's considered in copyright or out of it. And let me know, either as a comment here or by
personal email anything interesting you find.
Posted by Roy Tennant on August 25, 2008 | Comments (6)