Library Journal Mobile
Log In  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine

Historical Newspapers Project Grows

Since 2001 initiative launch, ProQuest has scanned more than 125 million documents

By Michael Rogers -- Library Journal, 5/1/2005

ProQuest Information and Learning's ambitious Historical Newspapers project launched with much anticipation in 2001. Four years later, the project has made admirable progress, with 125 million documents expected to be digitized by June. The company's purpose, however, is not just to push content out onto the web in hopes that libraries can use it. ProQuest's philosophy is to take the end user into the creation process.

Publications in the program include the New York Times (1851–2001), Wall Street Journal (1889–1997), Washington Post (1877–1988), Los Angeles Times (1881–1984), Chicago Tribune (1849–1984), Boston Globe (1872–1922), Atlanta Constitution (1868–1925), and Christian Science Monitor (1908–91). All of the digital files begin with the first issue, and most go up until the publication itself began offering electronic editions. Together, they represent 16 terabytes of data, which double the size of ProQuest's digital stockpile, making the project the single largest investment in the company's history. Each publication is presented in full-image scans and ASCII text, allowing for full searching.

The bulk wasn't the projects toughest hurdle. Rod Gauvin, ProQuest's senior VP, marketing and publishing, told LJ that the quality of the source material sometimes proved challenging as not all of the microfilm was in pristine condition, forcing ProQuest often to use multiple sources to produce a publication's complete run. The diversity of newspaper sizes also proved demanding.

Zones

Librarians had repeatedly told ProQuest they wanted a digitized newspaper project, recalled Joe Mills, ProQuest VP of manufacturing operations, but before tackling it the company had to look at the big picture and think through all implications. The firm wanted the database to be searchable through the standard keyword box for specific items but also hoped to replicate the feel of a real newspaper, thereby doing double duty for both researchers looking for exact information and general browsers.

Along with keyword, the database is searchable by date, issue, relevancy, and article type (obits, editorials). Abstracts consist of the headline and the first 40 lines of text. Images appear in either JPEG or TIFF formats at 300 DPI. Results can be displayed at the article and page level; pages were scanned whole, so ads, comics, etc., can be accessed.

Although the articles appear in their entirety, each is divided into zones by an editor, with the headline being one zone, the first block of text being another, and so on. Zoning produces more detailed results, creates citation/abstract functionality, improves OCR (optical character recognition) rate for searchable ASCII text, allows article-level blocking with no text loss, and presents smaller, sharper image files.

With the age of many of the sources—both paper and microfilm—not all images are 100 percent clear. Gauvin said that the company is continually enhancing and cleaning content even after it's been digitized. Electron beam recorders are used to scan the original paper sources where possible, which reproduces every pixel. ProQuest then creates 35mm microfilm, still a valid preservation medium. "Digital" is not archival, Gauvin said.

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

Sponsored Links




 
Advertisement
Sponsored Links

MOST POPULAR PAGES

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs

  • Cheryl LaGuardia
    E-Views

    November 20, 2009
    Portable Libraries, Mobile Students
    I attended this excellent ACRL-NE Information Information Technology Interest Group (ITIG) Social pr...
    More
  • Cheryl LaGuardia
    E-Views

    November 20, 2009
    Parker Library on the Web
    Corpus Christi College (Cambridge) and Stanford University Libraries recently released t...
    More
  • » VIEW ALL BLOGS RSS

Photos

  • Design Institute 2007
    December 11, 2007 at Chicago's Harold Washington Library Center:Design Institute 2007
  • Learning Gardens
    New York's GreenBranches program links the library to the street.
  • Green Picks: LBD May 2007
    Want to reduce your library's carbon footprint? Join the Cradle-to-Cradle revolution. Helen Milling shares the green products her firm is using.
Advertisements





LJ NEWSLETTERS


Booksmack
LJXpress
LJ Academic Newswire
LJReview Alert
LJ Criticas Review Alert
SLJ Extra Helping
Curriculum Connections
SLJTeen
PWDaily
Children's Bookshelf
PW Comics Week
Cooking the Books
Religion BookLine
Please read our Privacy Policy
©2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites