Login  |  Register          Free Newsletter Subscription
Subscribe to LJ Magazine
Email
Print
Reprint
Learn RSS

Digital Preservation: Paradox & Promise

Richard Wiggins (netConnect) -- netConnect, 4/15/2001

Suppose one morning the world discovered that, with no advance notice, tens of thousands of pages of popular content had mysteriously disappeared from the web site of a major federal agency. The outcry would be immediate and loud. Senators would demand hearings. Pundits would proclaim depravity. Editorialists would decry the decline of public values.

Yet that is exactly what happened on January 20, 2001, Inauguration Day in the United States. When George Bush took over the presidency, he also took possession of the White House web site, www.whitehouse.gov. All of the previous content of that site, and its companion searchable document archive, www.pub.whitehouse.gov, were completely wiped clean, replaced with a skeleton site for the new administration. The result was a massive example of 'link rot' in one of the most popular sites on the web. AltaVista reported 170,000 links to the site--many of them 'deep links' (i.e., deep within the hierarchy of a web site)--that were suddenly broken. It is impossible to know how many thousands or millions of personal bookmarks were similarly trashed.

Of course, no one would expect that the new President would want to preserve a complete collection of the speeches and official communications of the Clinton administration. Historians, however, might take a different view; such documents are vital to analysis of a presidency. Citizens, too, might want to be able to look up a presidential document, whether it is a disaster decree from 1993 or the text of the pardon of Marc Rich.

 

A Digital Archive Disappears

In 1991, the network organization of the Committee on Institutional Cooperation (Big Ten schools plus the University of Chicago) saw a need to preserve electronic serials and launched the CICNet Journal Archive. This ambitious digital preservation project sought to capture and archive the content of numerous nascent electronic journals. The project attracted considerable attention among authors and editors of e-journals. Writing in PACS Review in 1996, Bonnie MacEwan and Mira Geffner said, 'Our long-term goal is to create a significant collection of electronic journals on the Internet, which scholars, libraries, and individuals around the world can access via the Web.'
(info.lib.uh.edu/pr/v7/n4/mace7n4.html).
Alas, essential funding never appeared, and CICNet itself ceased operations in 1997. The CICNet Journal Archive vanished with it. Ironic, indeed, to lose not a mere collection but an archive whose purpose was to prevent loss of electronic content. How many pioneering e-journals, many of them hosted on now-defunct Gopher servers, were lost for eternity?

NARA to the rescue
In this case, not all is lost to history. The National Archives and Records Administration (NARA) has undertaken an effort to preserve several renditions of whitehouse.gov across the Clinton years--which happened to correspond to the years of the web revolution. While the NARA archive at clinton.nara.gov lacks much of the functionality of the former whitehouse.gov sites, at least the content itself is preserved--or so we are assured. (Over a month after the transition, NARA's site said, 'Please check back with us. We are busy processing records and will post our project schedule as soon as possible.')

In fact, not only did NARA take steps to preserve the content of the Clinton White House, it also asked all agencies in the Executive Branch to take 'snapshots' of their web sites as they existed at the end of the Clinton administration. On January 12, 2001--just eight days before the inauguration--Lewis J. Bellardo of NARA sent a fax to 'agency record officers and information resource administrators' with this request; news reports indicate many agencies had trouble complying. While one appreciates NARA's efforts, one wonders how much more could have been accomplished with more planning and notice.

This case exemplifies several of the challenges of preservation in the digital age. Not only is NARA motivated by its very mission to preserve the digital content under its auspices, but the Presidential Records Act of 1978 makes it the law. Alas, not every custodian of digital content operates under such motivation. In many cases, content may be born in digital form, live online for months or years, and then vanish--without a trace, never to be recovered.

Modes of digital death
The changing of the political guard is but a single cause of such disappearances. We can identify different modes of digital death:

  • The New Replaces the Old: Change needn't be as dramatic as the transition of power of the U.S. presidency. Every time an organization has new information to put on a web site or CD-ROM or other digital format, there is a tendency to publish the new content and simply to overwrite or toss the old.
  • Content Reorganization: Particularly in the web sphere, it is popular to reorganize content space periodically: when a new master is assigned to care for the content, when significant change in the content structure has occurred, or when people deem the content organizations 'stale.' Many an Error 404 stems from simple reorganization.
  • Death of a Sponsor: When the sponsoring organization of a collection dies, so too may its digital content. The day the Al Gore campaign conceded the 2000 election, its web site went dark. Anyone hoping to analyze documents on that site was instantly deprived of access.
  • Sponsor Loses Interest: Most traditional print publishers consider the 'back file' to be an asset worth protecting. Web-based publishers seem to lack this long-term view. For instance, Internet World, a print publication with a companion web site, has published since 1994. Until recently, a complete archive of back issues appeared on the web site. Now, the archive extends back only to July 1999.
  • Sponsor Fears History: In many cases, corporations may consciously avoid maintaining historical documents for fear of litigation. Ford, IBM, and other major companies have been sued in recent years for infractions alleged to have taken place over 50 years ago. Corporate document retention policies therefore tend to encourage disposal of documents. As corporate knowledge moves to the company intranet, corporations will face an increasing dilemma: to what extent are we protecting ourselves from potential litigation, and to what extent are we destroying corporate knowledge?
  • Lost Functionality: The disappearance of www.pub.whitehouse.gov, an MIT-developed site that offered rich search functionality, exemplifies this particular form of digital death. While NARA has preserved the raw content, the NARA search interface pales in comparison to the MIT product. Similarly, in February 2001, Deja.com (formerly DejaNews.com), which provided a searchable archive of postings to Usenet news groups, was taken over by Google. While Google acquired all of the 'intellectual assets' of Deja.com, it failed to preserve the search interface, breaking thousands of hyperlinks and making searches impossible for savvy Deja users. Google promises to restore the lost functionality in time; other takeovers may not be so sensitive to user demands.
  • Media Format Obsolescence: Anyone who owns data stranded on a 51/4' floppy disk knows the impact of media format changes. A newer storage technology supplants your format of choice; unless you take steps to copy old content to newer media, your data become stranded.
  • Content Format Obsolescence: Data stored in a proprietary format, such as an older version of WordPerfect or an obsolete tool such as PC Write, may be completely unreadable to current versions of software. Even web content could conceivably become obsolete, as the transition to newer generations of HTML or XML (or popular proprietary formats such as Flash) render old content unusable with newer software.
  • Disaster: Whether a small-scale disaster (e.g., server meltdown) or a disaster that affects a campus (e.g., the 1994 earthquake that destroyed much of Cal State-Northridge), disaster can wipe out digital data whose sponsors have failed to provide for adequate offsite backup.

The cost paradox
We live in an era when digital storage costs are at an all-time low--costs so low that old-time computer folks marvel. Consumers can buy 50-gigabyte disk drives for $150--or only $3 per gigabyte. A decade ago, disk space cost almost 400 times more than it does currently. This apparent improvement carries costs as well. Writing in Library Journal in 1999, Stewart Brand cited the overarching paradox of digital preservation: people think of digital content as inexpensive--and inexpensive things are not worth preserving. And with greater storage comes the desire for ever larger files--digital feature films were unimaginably large just a few years ago. Still, the new storage capacities seem impressive.

The astonishing decline in the cost of disk space makes it possible to consider archiving extremely comprehensive corpuses. Brewster Kahle, founder of the Internet Archive (see sidebar), observes, 'People can only create textual information at the rate at which they can type. The amount of data that all the people on the planet could conceivably type in their lifetimes is no longer a scary number. Looking forward to the day when everyone on Earth can create content in digital form, we can conceive now of being able to archive every word ever typed.'

The raw cost of disk space is merely one component in the costs of offering digital content online (whether on the web or via some other online publishing medium). Those disks must be connected to servers; server hardware and software also cost money. For a large or popular collection, electricity begins to become a factor. Systems administration requires talents that are in short supply these days, adding significant costs.

Offline storage
In some cases, custodians of content may choose to archive it offline, rather than paying all of these costs of online storage. In this case, the custodians must choose an offline digital storage medium that they think will endure for a period of time. Unfortunately, we have not settled on such a single digital storage medium. End users of Windows machines tend to use varieties of Zip drives. UNIX and NT system administrators use any of a variety of tape formats, from 8mm tapes offered by Exabyte and other vendors (up to seven gigabytes per tape) to DAT (Digital Audio Tape, from eight to 24 GB) to DLT (from ten to 75 GB). These various forms of digital tape are not cheap; data-grade cassettes can cost from $10 to $50 apiece. Interestingly, it appears that per gigabyte disk costs are falling much faster than per gigabyte tape costs; increasingly, some data custodians choose disk-to-disk backup as cost-effective.

Contending with tape are various forms of optical storage. Optical disks have competed alongside magnetic disk and tape as a sort of hybrid of advantages, offering the random access of conventional disk and the portability of tape. Yet optical disk, and its magneto-optical cousin, has never managed to compete well in terms of price. In the last two years, the cost of CD-R (write-once, read multiple compact disc) has plummeted, with CD-R discs readily available in bulk for as low as 25¢. CD-RW disks (rewritable) are beginning to fall in price as well, with availability at about $1.50 per disc. CD-RW drives are now standard equipment on new PCs and available for about $150 for old ones.

The CD format, however, is limited to 650 MB per disk--a huge number when CDs were introduced, but a small fraction of the disks included on garden-variety computers these days. The new DVD-RAM format can support up to 5.2 GB of data on discs costing up to $30 apiece. DVD-RAM drives now cost as little as $600. DVD-ROM drives, which 'burn' DVDs that are indistinguishable from those pressed by mass market producers, cost several thousand dollars. Archivists may consider DVD-ROM a more enduring format than DVD-RAM.

No matter what the medium, when digital content is preserved offline, custodians must be prepared to migrate the content to new formats. This may be necessary because the old format becomes obsolete, or it might be because the physical storage medium is in danger of wearing out. Digital cognoscenti argue about the shelf life of DAT vs. CD-R vs. conventional CD. Archivists agree the only answer is periodic digital inspection of a sample of the corpus and copying to new media as required. Some distinguish refreshing--copying to a fresh copy in the same medium and format--versus migration--copying and translating to a new format to avoid obsolescence. Paper (at least acid-free paper) and microfilm have huge advantages here. Of course, storage costs are not the only costs to consider. Dan Greenstein of Kings College, London, notes that digital preservation resides in a context of digital content management, including Data Creation, Data Selection & Evaluation, Data Management, Resource Disclosure, Data Use, Data Preservation, and Rights Management. Each of these areas carries costs. To be sure, not all of these costs can be fairly laid at the door of digital preservation; rather, they represent a sort of life-cycle view of the costs of digital content.

 

The Internet Archive

A great deal of the digital content we care about exists on the web. Brewster Kahle's ambitious Internet Archive project attempts to capture all of the content on the publicly accessible web.

Kahle launched the Internet Archive in 1996 as a nonprofit cousin to the commercial service Alexa, which provides the 'Related Sites' functionality in Internet Explorer and Netscape Navigator. As Alexa watches the web surfing behavior of thousands of volunteers, it also gathers web content to contribute to the archive. The other primary contributor is Compaq Corporation.

The archive now consists of some 40 to 70 terabytes of content online, with data for 1999 through the present on disk. Content dating back to the start of the archive is on tape. Every month ten or so more terabytes of web content are contributed. By contrast, according to estimates by Kahle and by other experts, all the text of all the holdings of the Library of Congress would require about 17 terabytes of storage. The archive uses the same commodity IDE disk drives you can buy at the local Best Buy. Currently it employs 80 gigabyte Maxtor drives. Kahle says, 'Commodity disk costs are as low as $7000 per terabyte. We can't afford expensive storage.'

While the archive's initial focus is on the textual content of web pages, it is branching into multimedia content as well as nonweb content. For instance, the archive is experimenting with capturing television for on-demand streaming of old content (e.g., news programming) via the web.

Currently, only a limited number of scholars are granted access to the archive, via a proposal process. Kahle says about 200 scholars have been granted such access. Thus far, tools for navigating the archive are primitive. Kahle says the goal is to produce an interface whereby a scholar could 'dial back the clock' to a particular date and surf the archive's view of the web on that day.

Born digital or digitized?
The same decline in the cost of disk storage also creates new applications that consume vast quantities of disk, such as digital video. Berkeley researchers Hal R. Varian and Peter Lyman have studied the question of how much content of all types the planet produces annually. They concluded, 'The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion GB of storage. This is the equivalent of 250 MB per person for each man, woman, and child on earth' (www.sims.berkeley.edu/how-much-info). Over 93 percent of new information produced is created in digital form--or 'born digital.' Lyman observes, 'Information produced by institutions (governments, corporations, universities) is far more likely to be preserved because institutions have an urgent need to create and preserve their own archival histories. But private individuals rarely have the motivation to preserve digital documents, and don't always have the resources or expertise to do so if they recognize the need.'

Even individuals or small organizations that recognize the need to preserve their digital content may find that the tools they choose fail them. Driveway.com , a provider of free web-based digital storage, had amassed a user base of some two million customers. On February 20, 2001, the company announced the demise of this service effective March 5. A customer on a two-week vacation would have returned to find precious files no longer online.

Whether for individuals or for institutions, however, the issue for custodians of digital content is which items and collections that are 'born digital' ought to be preserved. Not all content is worth the cost of preservation.

The copyright barrier
During the 2000 election campaign, the Internet Archive partnered with the Library of Congress to capture for posterity various campaign-related web sites. Yet the archive's Kahle indicates that questions about intellectual property rights cloud the ability to offer unfettered access to these materials, once freely available on the web. Rights management issues affect every digital preservation project. Elizabeth Yakel, a professor in the School of Information at the University of Michigan, observes:

In the print world, copyright legislation acknowledged that the institutions (research libraries) preserving the published heritage had to have some freedom and power to copy and reproduce materials in order to preserve the intellectual part of a work (particularly if the physical artifact could not be maintained). As we move into the digital world (both in terms of preservation of digitally born and digitally re-created documents), this dynamic has changed. Research libraries have legal restraints thrust upon them. Unfortunately, no other societal mechanism (the publishers, professional associations, etc.) now has the infrastructure in place to preserve e-books, e-journals, etc., in a structured way that would ensure that these materials are continually accessible in the future. What is at stake is the intellectual tradition of various fields and the ability for people in the future to understand our contemporary culture.

Is there hope?
While digital content can be ephemeral, digitization can also preserve content. Many projects to digitize paper or analog media incorporate preservation as a goal. For instance, the National Gallery of the Spoken Word (www.ngsw.org) seeks to digitize speech content from the Vincent Voice Library at Michigan State University, among other sources. Much of the Vincent collection is on old reel-to-reel audiotape that is brittle or decomposing. Once converted to digital format, the prospects for preservation actually increase. Another example: working with the British Library, Kevin Kiernan of the University of Kentucky has produced scanned images of the only remaining original copy of Beowulf. The Electronic Beowulf thus preserves and extends access at the same time. (www.uky.edu/~kiernan/eBeowulf/guide.htm)

Once it is in the digital realm, its custodians must fight the various modes of digital death. To solve the problem of media obsolescence, there is one simple approach: copy the digital content to new media as it evolves. While this is simple, it requires a concerted effort. No doubt many professors' offices contain vital data stored on obsolete media, and many library shelves offer software on media to which patrons no longer have access.

To combat format obsolescence, researchers propose several approaches:

  • Preserve the Software. This requires keeping the original software program used to view or edit digital content intact. The problem of preserving corresponding operating systems and hardware compounds the issue.
  • Refresh or migrate as appropriate.
  • Emulate. Provide software tools to emulate essential viewing applications or operating environments.
  • Encapsulate. When storing digital content, store not only the text, images, audio and video, etc., that end users consume, but also store the necessary information to interpret the content using new systems across time.

These strategies become progressively more theoretical. It is one thing to copy from a floppy disk to a CD-R; it is quite another to build a software emulator for WordStar under DOS or to store a Microsoft Word 2000 document in an encapsulated form understandable to some future software system.

No one would be foolish enough to assume the end of the march of technology, but there is some hope that media and format obsolescence may diminish. The adoption of the DVD (and soon DVD-RAM) as a widespread consumer format may provide a platform for data that will live for many years, if not decades.

The adoption of markup strategies such as XML could free us from proprietary formats that force the need for sophisticated techniques such as emulation and encapsulation. Advocates of markup methodologies such as SGML and XML have long argued that they provide the only viable solution for long-term reusability of content; if the old tool for processing a set of tags no longer exists, a new tool can be brought to bear on the problem easily.

These approaches address some of the technological aspects of digital preservation. The economic, social, and legal barriers may be the more daunting, however. Nevertheless, libraries and librarians have long had a mandate for preservation as a part of their mission. Jeanne Drewes, assistant director for access and preservation at Michigan State University Libraries, who consults with libraries that have suffered disasters, notes, 'Digital is the preservation issue now and for the foreseeable future. After many years of dealing with paper-based materials, libraries and archives have an understanding of the issues and the process for preserving materials.'


Richard Wiggins (wiggins@msu.edu; www.netfact.com/rww) is Senior Information Technologist at Michigan State University, East Lansing, and author of LJ's netConnect column.

 

Capturing Web Sites

There are two basic approaches to archiving a web site for posterity.

  • Preserve the content in the format in which it is stored on the server, whether flat-file HTML, Java or JavaScript, database records, etc. This assumes access to original content normally only granted to the webmaster or system administrator.
  • Capture the content as experienced by an end user browsing the site, employing a software tool. This can be done by anyone with access to such tools.

Web content archived in internal format assumes that the future reviewer will be able to use all the tools necessary to cope with all the various forms of content on the site. Moreover, if the content is merely dumped to offline media in a native format, it will be necessary to restore the operating environment in which it was served as well as the content. For instance, if a web site delivers database-driven content using Cold Fusion or Active Server Pages, one would need to restore the database, the Cold Fusion or ASP interface, the web server, and any supporting static HTML content in order to rebuild the site fully for perusal.

In order to archive a web site externally, software must play the role of an end user with a web browser. Microsoft Internet Explorer has a built-in feature to do limited archiving in this fashion. Other products, such as Offline Explorer Pro, provide more sophisticated control. External archiving faces its own challenges:

  • The software must translate URL links into a new, artificially constructed local document hierarchy that mirrors the structure of the target site. All hyperlinks must be translated into references to the archived version of the site.
  • The archiving software can only capture what it can simulate through HTML navigation. Some tools can handle image maps, but highly interactive navigation tools built in Java or JavaScript may inhibit archiving.
  • To archive a database, the tool must be able to enter items from a pick list as if a human were making selections. Otherwise, the external software will not be able to capture the content.
  • Tools vary in their ability to capture multimedia content. For instance, it may not be possible to capture local copies of streaming content, e.g., video clips in Real format.

Site-centric archiving of web content, undertaken by a webmaster, tends to be very complete but may require significant work to restore to a usable web experience. Client-centric archiving, done by an external party, tends to yield renditions of the site that can be navigated using any browser--but the rendition may be very incomplete.

Email
Print
Reprint
Learn RSS

Talkback

We would love your feedback!

Post a comment

» VIEW ALL TALKBACK THREADS

Related Content

Related Content

 

By This Author

There are no other articles written by this author.

Sponsored Links



 
Advertisement
Sponsored Links

More Content

  • Blogs
  • Podcasts
  • Photos

Blogs


Sorry, no blogs are active for this topic.

» VIEW ALL BLOGS RSS

Photos

Advertisements





LJ NEWSLETTERS

Click on a title below to learn more.

LJXPRESS
LJ ACADEMIC NEWSWIRE
LJ REVIEW ALERT
CRÍTICAS
Library DVD Guide
©2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy
Please visit these other Reed Business sites