Historical Newspapers Project Grows
Since 2001 initiative launch, ProQuest has scanned more than 125 million documents
By Michael Rogers -- Library Journal, 5/1/2005
ProQuest Information and Learning's ambitious Historical Newspapers project launched with much anticipation in 2001. Four years later, the project has made admirable progress, with 125 million documents expected to be digitized by June. The company's purpose, however, is not just to push content out onto the web in hopes that libraries can use it. ProQuest's philosophy is to take the end user into the creation process.
Publications in the program include the New York Times (1851–2001), Wall Street Journal (1889–1997), Washington Post (1877–1988), Los Angeles Times (1881–1984), Chicago Tribune (1849–1984), Boston Globe (1872–1922), Atlanta Constitution (1868–1925), and Christian Science Monitor (1908–91). All of the digital files begin with the first issue, and most go up until the publication itself began offering electronic editions. Together, they represent 16 terabytes of data, which double the size of ProQuest's digital stockpile, making the project the single largest investment in the company's history. Each publication is presented in full-image scans and ASCII text, allowing for full searching.
The bulk wasn't the projects toughest hurdle. Rod Gauvin, ProQuest's senior VP, marketing and publishing, told LJ that the quality of the source material sometimes proved challenging as not all of the microfilm was in pristine condition, forcing ProQuest often to use multiple sources to produce a publication's complete run. The diversity of newspaper sizes also proved demanding.
ZonesLibrarians had repeatedly told ProQuest they wanted a digitized newspaper project, recalled Joe Mills, ProQuest VP of manufacturing operations, but before tackling it the company had to look at the big picture and think through all implications. The firm wanted the database to be searchable through the standard keyword box for specific items but also hoped to replicate the feel of a real newspaper, thereby doing double duty for both researchers looking for exact information and general browsers.
Along with keyword, the database is searchable by date, issue, relevancy, and article type (obits, editorials). Abstracts consist of the headline and the first 40 lines of text. Images appear in either JPEG or TIFF formats at 300 DPI. Results can be displayed at the article and page level; pages were scanned whole, so ads, comics, etc., can be accessed.
Although the articles appear in their entirety, each is divided into zones by an editor, with the headline being one zone, the first block of text being another, and so on. Zoning produces more detailed results, creates citation/abstract functionality, improves OCR (optical character recognition) rate for searchable ASCII text, allows article-level blocking with no text loss, and presents smaller, sharper image files.
With the age of many of the sources—both paper and microfilm—not all images are 100 percent clear. Gauvin said that the company is continually enhancing and cleaning content even after it's been digitized. Electron beam recorders are used to scan the original paper sources where possible, which reproduces every pixel. ProQuest then creates 35mm microfilm, still a valid preservation medium. "Digital" is not archival, Gauvin said.























