Saving Digital History

Archivists and librarians share similar challenges in preservation, says Jessamyn West

Mark Matienzo is a Washington, DC, archivist and the creator of ArchivesBlogs, a site that aggregates blogs by and for archivists. He uses technology to further the existing preservation and access goals of archivists and historians and jokes that when blogging software creators made the decision to name past entries “archives,” they made the job of looking for blogs about archives very, very difficult. “Many archivists either don't know or don't care to know enough about technology to think about putting it into practice,” Matienzo says. “[O]ne of the perennial problems for all archives is having a huge backlog, because archives usually accession [increase by addition] things pretty often. Since so many are understaffed or otherwise suffer from a lack of resources...the backlog gets bigger and bigger.” (See “Reassessing Backlogs,” p. 8–9.) Blogs and blogging tools are one way for archives to make digital content accessible to more people in more places. For instance, the Polar Bear Expedition Digital Collections has been put online by the Bentley Historical Library at the University of Michigan (UM), which has a vast online archive of materials relating to a U.S. military intervention in northern Russia at the end of Word War I. Many of the soldiers in this action were from Michigan. The innovative online collection interface was created by professor Elizabeth Yakel and her students at UM's School of Information, who are attempting to create online finding aids that aren't solely electronic versions of paper documents. To this end, their archive allows and encourages commenting and cross-linking of files by site visitors. According to Matienzo, it allows UM to “unite physically separate collections intellectually without destroying provenance.” The main page of the site includes not only a description of the projects but also lists of recent new users and comments.

Talking into history

The social features of the Polar Bear site make it feel new and vital every time you look at it, not a usual feeling to encounter with archives. Many comments are left by people familiar with the subjects, and they provide additional facts or links to other online content with complementary information. One comment says, “Anthony Rataczak was my maternal uncle. He was born and raised in Poland and came to the United States with his father, John, and his mother, Frances, six sisters, Ann, Mary, Angela, Helen, Pauline, and Celia and one brother, Joseph, in 1905.” For archivists used to working with texts and documents of uncertain origin and content, this collaborative data discovery is thrilling. Another feature of this archive is its “link paths,” a shaded list of links appearing at the bottom of each page that points to other files in the collection with the caption, “Researchers who viewed this page also viewed.” This allows newer visitors to the site to gain knowledge about related materials before even becoming familiar with the collection. These collaborative archive 2.0 projects are very much the exception, even in digital archives. Matienzo says the shift toward decentralization of knowledge is coming up against an institution that is based, to a certain degree, on believing that knowledge could be held, preserved, and protected. “I think archivists are starting to recognize the importance of blogs and other electronic content,” he says. “They know that people are developing new uses for content, they know that technology is decentralizing knowledge. [How] archivists/repositories deal with that is a result of a lot of factors, particularly existing institutional culture.”

What librarians know

Archivists are facing the same problems librarians have been dealing with for years, perhaps even decades. As a larger percentage of library content becomes digital and licensing becomes a greater hurdle to maintaining “content” than it is to preservation, how do we keep information from disappearing into a digital black hole? How does the shift in formats translate into a shift in procedures and policies and expectations? Librarians get it: the content we steward is shifting from print to digital. Our libraries require more hard drive space in addition to more shelf space. Patrons need to know how to click and type as well as how to read. And, yet, what of posterity? How will our paths and trackings through the digital realm be accumulated, organized, even archived? This question becomes further complicated by the webby-ness of our online interactions and content production. Content is still being generated in static letter, essay, and book formats, but it's also arriving online, prelinked and connected. While the correspondence between Freud and Jung has been collected, trying to track and save the hyperlinkedness of blogs, comments, IMs, and emails is much more complex. As a blogger, I write and link to other things online, and it's become increasingly difficult to write essays without using hyperlinks. At the 2006 Society of American Archivists conference, I was pleasantly surprised by what I heard, though I became concerned for the future of preserving digital information. As archivist Thomas Lannon said, “This 'unfixedness' of blogging in its electric form is what gives the technology the power of immediacy but also its weakness in impermanence.”

Scope notes

Peter Lyman, professor at the School of Information Management and Systems (SIMS) at University of California–Berkeley, describes four issues in archiving digital media: cultural, legal, economic, and technical. Legal and economic issues are what they've always been. Lyman explains the cultural problem of archiving digital content as one of scope and immediacy. “All documents follow a life cycle from valuable to outdated, but then, perhaps, some become historically important,” he says. “[T]he web is not stored in attics; it just disappears. For this reason, conscious efforts at preservation are urgent. The hard questions are how much to save, what to save, and how to save it.” The Library of Congress (LC) comments in a Council on Library and Information Resources report about the ebb and flow of availability of web content: “The web is growing steadily and at the same time is continually disappearing. The average life of a web page is only 44 days; 44 percent of the web sites found in 1998 could not be found in 1999.” This cultural problem of ephemeral content ties in to the technical problem. “First, information must be continuously collected, since it is so ephemeral,” Lyman explains. “Second, information on the web is not discrete; it is linked. Consequently, the boundaries of the object to be preserved are ambiguous.”

What is a blog, really?

That boundaries problem is the crux of the trouble with blogs. We still talk about them as if they were discrete, as if all blogs had similar purposes and presentation. People say they like or dislike blogs, which is like saying you like or dislike the mail, or magazines. A blog is just a method of communicating, online. A blog is usually defined as a web page with rotating content—the newest material enters at the top of the page, and the older material shuttles off the bottom to the archives. Many blogs have multiple authors, and some group blogs have several thousand contributors. Most are created with a content management system, which is blogging software that provides an easy-to-use front-end interface to what is essentially a small relational database. What you are looking at when you read a blog is a query against that database, which may or may not generate static HTML pages.

History via conversation

For anyone who is concerned with our cultural history, much of it is taking place online, on blogs. Making sense of our collective past, especially our recent past, must take the digital narratives of events into account. Blogs are the diaries and letters of the present day. At the same time, they often refer to other blogs, news reports, digital images, videos, and sound recordings. Understanding a point in the past is made simpler with blogs because readers are able to find easily contemporaneous informal stories recounting the same events. Yet this creates a problem—figuring out how to store this linked set of web sites and media for later retrieval and/or reassembly. The Cluetrain Manifesto: The End of Business as Usual (Perseus) by Rick Levine et al. asserts that markets are conversations. What is becoming more clear is that history is also a conversation. The truism that “history is written by the winners” is no longer quite as relevant in an age where barriers to publishing are low and benefits of participation are high. I was first made aware of the value of blogs in reporting after an earthquake I experienced in Seattle in 2001. It was brief, and by the time it was over, I jumped online to see what had happened elsewhere in the city. The local news outlets had nothing on their web sites yet. This was a significant event that literally had not happened yet online. I checked MetaFilter.com, a large group blog that I participate in, and people were reporting what had happened from all over the Pacific Northwest. As news outlets began to get reports online, links were included in the thread nearly in the order they arrived online. Many of those links, though now defunct, still provide interesting insight into how news of a small-scale disaster propagates.

Blog proliferation

The past few years have seen media outlets start their own blogs, further blurring the line between the informality of individual blogs and the authoritativeness of major media outlets. In “TalkLeft, Boing Boing, and Scrappleface: The Phenomenon of Weblogs and Their Impact on Library Technical Services,” Paul Moller and Nathan Rupp write, “Mainstream media outlets such as the New York Times have begun to maintain blogs, which often link to stories in other newspapers.... Other media outlets, from the Boulder, Colorado, Daily Camera to CNN, have begun employing blogs as a means of further connecting with their audience.” Community-blogged archives exist for other major events such as Hurricane Katrina and the Indian Ocean earthquake. The actual understanding of what happened on 9/11 changed minute to minute—reassembling the history of that day is impossible if you rely on traditionally published media that report facts after they have become understood and codified. This attempt to catch the sense of newsworthy events while they are happening is a boon for historians and a challenge for archivists. In the overview description of the MINERVA project's September 11th Web Archive, LC states, “With the growing role of the web as an influential medium, records of historic events could be considered incomplete without materials that were born digital and never printed on paper.” While we have grown accustomed to news and media outlets maintaining their own recent archives, most blogs live on servers completely at the whim of the owner of the site. The content seems permanent until, in Lyman's words, “it just disappears,” as has happened many times to community blog and social software sites. For instance, the hard drive crash of diary-x.com obliterated 120,000 online diaries and blogs overnight. When Couchsurfing.com experienced multiple database crashes with insufficient backups, 90,000 people lost their personal profiles, email contact lists, and trip diaries. Both sites were able to reconstruct some of their content and structure from RSS feeds, Google caches, and less-recent backups, but much of that digital content was never replaced.

Where do the bytes go?

The first question is not what but how to archive. A blog looks different on different days, but it may even look different to different readers. People viewing a blog via its RSS feed will miss design changes, just as people who read a blog sorted by tag or category may not grok how posts relate to one another on a time line. Firefox even allows us to set our own stylesheets and page presentation via an add-on called Greasemonkey, which means that the blog version that I see may be different from everyone else's version, my own idioblog. As we become more comfortable with bibliographic citations that include indicators such as “Retrieved on November 13, 2006,” how exactly do we retrieve that? How do we guarantee that we see what that person saw? The Internet Archive has been creating and maintaining an archive of web content since 1996. It has been trying to create “a three-dimensional index that allows browsing of web documents over multiple time periods.” The Wayback Machine is its archive, accessible online via a URL search. It has a massive server farm containing two petabytes of data, expanding at a rate of 20 terabytes per month. These bytes, if turned into printed text, would vastly exceed the contents of the Library of Congress. Even this project has limitations. Since the data is obtained by crawling—following hyperlinks from page to page and grabbing the text and images on each page—content that is generated dynamically can be difficult or impossible to archive. If you have your blog set up to generate pages on the fly (the default for WordPress systems), the Internet Archive may not be able to capture your blog at all. If archiving blogs becomes a priority, what has to shift to make this equation work? The PANDORA Project of the National Library of Australia also began archiving (and cataloging!) selected Australian web sites in 1996. More recently, it enlisted the help of the Internet Archive to “harvest” every site in the .au domain over a six-week period during June and July 2005. This netted “185 million unique documents...from 811,523 hosts, and 6.69 terabytes of raw data.” However, owing to the uncertain legal status of those collected documents, they are not currently being made available online. Do archivists need to find a way to improve on this, or are we already looking at best practices? It is certain that the Internet Archive's process for harvesting web content is clearly becoming the standard method as more state library systems contract with them to store and preserve digital records.

“This matters”

Institutional culture is the wall to scale to provide open access and inroads to our cultural history. Small projects, as well as larger initiatives like the Internet Archive, are making a serious, positive dent in how we look at what we know. This is true both in the literal sense of how the content appears and in the more metaphorical sense of how it feels to us. Librarians and archivists are already surrounded by history and constantly share what we know with those who are searching. We have new opportunities to open the doors to our collections wider and to share the work we do as well as the thrill of discovery. The more we say “this matters,” and the more we say “we can do this,” the more we find people to help us. Archiving and making digital content available is going to become an even larger part of our jobs. Let's start with blogs.

Link List


Life Archive

Many libraries, including the Greenwich Public Library, CT, have oral history collections of residents, famous and not so famous. As the U.S. population ages, people are starting to wonder if what they're creating online will survive them. Libraries have always kept some kind of vertical file for local residents. The DeKalb Public Library, IL, has a file on author Richard Powers, which proved recently valuable when The Echo Maker won the National Book Award. Perhaps it's time for libraries to run their own blog aggregators, so that the next Richard Powers's juvenilia can be preserved for posterity. Open source aggregators exist, from Gregarius (PHP) to Planet (Python) to Plagger (Perl). Jon Udell, Microsoft technology evangelist and pioneer of LibraryLookup, has been thinking about this issue. “I have ventured into this confusing landscape because I think that the issues that libraries and academic publishers are wrestling with—persistent long-term storage, permanent URLs, reliable citation indexing and analysis—are ones that will matter to many businesses and individuals,” he writes. “As we project our corporate, professional, and personal identities onto the web, we'll start to see that the long-term stability of those projections is valuable and worth paying for.” Udell interviewed librarian Daniel Chudnov in a podcast and discussed possibilities. Chudnov went on to articulate a prototype vision of what a library project dedicated to archiving web logs could look like. This service, which is similar to the design of the journal archive tool LOCKSS (Lots of Copies Keep Stuff Safe), holds promise for keeping electronic content from falling into what Jessamyn West calls “a digital black hole.”—Jay Datema

The New Vertical File

How do we follow individuals as they move into digital social spaces? Already, people from public spheres are venturing dramatically into the digital realm. A few examples:
  • The campaign blog of Barack Obama when he was making the move from Illinois state senator to being the junior U.S. Senator from Illinois. Only available now at the Internet Archive.
  • William Gibson's blog postings from the tour for his book Pattern Recognition.
  • Actress and comedian Rosie O'Donnell's Flickr photostream.
Historians of any of these personalities would have to take their digital ephemera into account, and yet we still don't quite have the tools—including perhaps the trained staff—to do this sort of thing well. As we think about “blogs as content,” what about “blogs as tools”? The basic functionality of most blogging software, such as commenting, next and back indicators, and categories and tagging, seems to be a natural fit for presenting information digitally in ways that encourage browsing and participation. A few recent digital content projects have used these tools to good effect. Beyond Brown Paper is a collection of digital photographs “document[ing] much of the history of the Brown Paper Company of Berlin, NH, from the late nineteenth century through the mid-1960s.” Images have been scanned, categorized, tagged using Flickr and uploaded into an online browsable archive using Scriblio running on top of WordPress. The project allows telephone comments, an innovative addition. Western Springs History archives recent and older photos and information about houses in Western Springs, IL. The site employs a clickable Google map of house locations and collects comments from visitors on each individual house's post. The site was originally built in Movable Type and moved to WordPress.—Jessamyn West

Author Information
Jessamyn West maintains the blog librarian.net and works as a community technologist in central Vermont
Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.


RELATED 

ALREADY A SUBSCRIBER?

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

ALREADY A SUBSCRIBER?