Login  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine
Tennant: Digital Libraries   


Link This | Email this | Blog This | Comments (6)


Hathi Trust Fun
August 25, 2008

The University of Michigan Library led the way in putting the books online that Google was digitizing from their collection, and to this day remains one of the few (the only?) to have made the effort. They were subsequently joined by the other CIC institutions (Committee on Institutional Cooperation) in a joint effort to mount the texts being digitized from all CIC institutions. The resulting effort is called, oddly enough, the Hathi Trust.

The Hathi Trust is building a shared digital repository, and although there isn't any information on the web site yet, there is information at the University of Indiana about the project. In reading that, I noticed that they were offering brief records describing all of the contents of the repository. Since I find it difficult to resist such temptation (as my earlier effort to make the records of the University of Michigan's public domain books searchable will attest), I contacted them and got the records.

They come in one compressed tab-delimited file, with a very abbreviated record for each title. Nonetheless, I thought it would still be fun to search it and see what I could find. So in an incredibly misguided implementation, I chopped it into 1.4 million tiny little XML files (yes, I am an idiot), and indexed it with my favorite indexer, Swish-e (which is XML-aware and can thus provide fielded searching as in a database). The result can be seen and used on my prototype server as the Hathi Trust Search. Since this is my prototype server I don't worry about making it pretty or even in making it fully functional, which you will realize if you try to browse past about the first 5,000 hits of any search. So sue me.

Meanwhile, have fun. Take a look and see what's there or isn't on your favorite topics. Find out what's considered in copyright or out of it. And let me know, either as a comment here or by personal email anything interesting you find.

Posted by Roy Tennant on August 25, 2008 | Comments (6)


Industries: News & Features
August 26, 2008
In response to: Hathi Trust Fun
Jeffrey Beall commented:

The reason the metadata records are brief is because your employer, OCLC, does not allow open access distribution of the full MARC records. So institutions like the University of Michigan have to strip down the records before openly distributing them. In this way OCLC is hindering access to information and hurting digital libraries.




August 27, 2008
In response to: Hathi Trust Fun
John Wilkin, AUL for LIT/TAS at U of Mic commented:

Jeffrey, OCLC's guidelines are not a factor in this. When we began planning this function with our partners, we struggled with a few questions about record distribution. Two examples may help to illustrate this. (1) In many cases, these records will come from the catalogs of our partner institutions, and the UM version of the record is immediately be out of synch with the source records from which these are derived. (2) Different institutions take different approaches to cataloging electronic resources, and the version at Michigan (a combined print and electronic record) may not be the preference for other institutions. We'll soon release documentation for this service, and those docs will make clear that what we intended here was a mechanism by which a 'consumer' institution could go to any of a number of different sources (including to Michigan's catalog or OCLC) to get full records. The body of content grows by hundreds of thousands of volumes per month, not a small flow for an institution seriously interested in keeping up with record changes. All of that said, we have only now released the mechanism and will continue to assess it with our partners.




September 30, 2008
In response to: Hathi Trust Fun
Rockwell commented:

Your Hathi search is fantastic! Much easier than the convoluted UM site! Thanks for the effort.




September 30, 2008
In response to: Hathi Trust Fun
Roy Tennant commented:

Rockwell, thank you for your kind comment. It really didn't take long to put it together at all, and UM/Hathi Trust really should get the credit for making the information available in the first place (they are still the only Google libraries who have).




October 1, 2008
In response to: Hathi Trust Fun
Rockwell commented:

Yes, I've noticed that U.C hasn't done anything yet with their Google scans. What I like about Michigan is that it seems they have released all of their PD books and journals in full-view. Google holds back on some, especially post-1923 reprints of older PD journals. Sometimes, Google won't even release full views of pre-1923 works if there's a newer version they've scanned. With Michigan, it's all out there.




October 14, 2008
In response to: Hathi Trust Fun
bowerbird commented:

> The University of Michigan Library > led the way in putting the books online they make _pages_ available, but not _books_. you can't download a public-domain _book_ as a single-file, but rather have to grab each and every _page_ individually, which is a drag, and makes _remixing_ unreasonably difficult. moreover, when viewing a-page-at-a-time, you have to choose between the text _or_ scan, when what you sometimes really want is to see the text _and_ the scan at the same time, so as to compare them to make sure the text is right. -bowerbird





POST A COMMENT
Display Name or Registered Users Login Here.
Please restrict submissions to less than 7,000 characters (including any HTML formatting).

Before submitting this form, please type the characters displayed above. Note the letters are case sensitive:


Advertisement

Advertisements





©2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites