Government Website Harvest Enlists Librarians, Educators, Students

As the United States—and the world—prepare for the January 20, 2017 presidential inauguration, libraries, institutions, and citizens are joining forces to identify federal government websites to be captured and saved in the End of Term (EOT) Web Archive.
End of Term Harvest session at New York Academy of Medicine Photo credit: Debbie Rabina

End of Term Harvest at New York Academy of Medicine
Photo credit: Debbie Rabina

As the United States—and the world—prepare for the January 20, 2017 presidential inauguration, libraries, institutions, and citizens are joining forces to identify federal government websites to be captured and saved in the End of Term (EOT) Web Archive. The archive currently holds government web content from the administration changes of 2008 and 2012, and in July resumed collection efforts for EOT 2016 content. Government document and subject experts have been joined by librarians, academics, political and social science researchers, educators and their students, and other volunteer nominators in semester-long efforts and all-day “nominatathons” to identify URLs that are then submitted for inclusion in the EOT Archive. Those that are in-scope and not duplicates are assigned a weighted score by project specialists and given a priority level for web crawling. A collaboration between the Library of Congress (LC), California Digital Library (CDL), University of North Texas (UNT) Libraries, Internet Archive (IA), George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office, the EOT Presidential Harvest 2016 preserves federal government websites (.gov, .mil, etc.) from the legislative, executive, and judicial branches of government. According to the EOT Harvest website, the archive is “intended to document federal agencies' presence on the World Wide Web during the transition of Presidential administrations and to enhance the existing collections of the partner institutions.” The public access copy of the archive is kept at IA; LC holds a preservation copy, and an additional copy is held at UNT for data analysis.

COLLABORATIVE WEB PRESERVATION

The idea for the partnership was born at the International Internet Preservation Consortium (IIPC) meeting in Canberra, Australia, in summer 2008. "We had just found out that the National Archives was not going to do their dot-gov crawl that they had done in 2004,” explained LC digital library project manager Abbie Grotke. “A number of us sitting around the room at the IIPC meeting who were already collecting government material in one way or another at our own institutions said, 'Well, let's do this together collaboratively.' And one of the big goals of that was to share a copy of the data among all the partners." The CDL, IA, LC, UNT, and GPO—all members of IIPC and partners in the National Digital Information Infrastructure and Preservation Program (NDIIPP)—decided to join forces to document changes to government websites during the administration change from George W. Bush to Barack Obama. "Digital government information is considered at-risk, with an estimated life span of 44 days for a website," noted NDIIPP director of program management Martha Anderson in a press release at the time. “This collection will provide an historical record of value to the American people.” Several of the organizations were already active in preserving government web content. LC has preserved congressional websites on a monthly basis since December 2003. UNT Libraries, as part of the GPO’s Federal Depository Library Program (FDLP), created its CyberCemetery in 1997 to capture and provide access to the websites and publications of defunct U.S. government agencies and commissions. While organizations such as FDLP have focused on collecting, preserving, and providing access to printed publications, they do not have the infrastructure to archive digital material. Government document librarians across the country had for some time been aware of the need for an organized web collection effort. "I've been really active for ten years at least trying to move the documents community towards collecting and preserving digital government information,” Stanford University U.S. government information librarian James R. Jacobs, one of the original participants, told LJ. Jacobs’s website, Free Government Information, has been supporting the preservation of digital government material for more than a dozen years.

HARVESTING HISTORY

Each partner contributed to aspects of the new project, from organization to application development. The nomination tool, a simple front-end interface designed to identify, prioritize, and describe the thousands of government web hosts, was built by UNT. The content collection was performed with the open-source Heritrix web crawler, developed by IA with support from IIPC. In order to aggregate EOT content, LC developed BagIt Library, an open source Java large-scale data transfer tool, as well as a desktop version, Bagger. Beginning in August 2008, IA began a broad crawl of government sites, supplemented with crawls by the other project partners. The URLs were collected in December 2008, and again after the January 2009 inauguration. Final comprehensive crawls were performed in spring and fall 2009 to document any final changes. Ultimately, each partner transferred their collected content to a single consolidated archive. Metadata and thumbnail images were generated by IA’s in-house tools, with CDL providing input on Dublin Core format. Once the data transfer was complete, in mid-2010, a total of 15.9 terabytes of data had been collected. In November 2011, the EOT Harvest resumed to document changes between Obama’s two terms—this time with the help of LIS students. After reading a post about the project on LC’s blog about the 2012 EOT Harvest, Debbie Rabina, a professor at New York’s Pratt Institute School of Information, thought that it would be a good project for students in her Government Information Sources class. She contacted Grotke, and the two developed a plan for Rabina’s students to identify government social media sites as a semester-long project. The class used government directories to search each agency for social media accounts, such as the U.S. Government Manual and the A-Z agency list available on USA.gov. They eventually nominated some 1,500 accounts found on Facebook, Twitter, YouTube, Flickr, Pinterest, GitHub, Foursquare, and others. Eventually, all 2012 EOT Harvest partners captured some 21 terabytes of data.

MOBILIZING IN 2016

Four years later, in the wake of the 2016 election, Rabina felt the need to further expand the harvesting efforts’ reach. "I was trying to think about what I could do as a librarian,” she told LJ. Drawing on her previous experience with the EOT Harvest, she said, “I thought this would be a good way for me to do something." Rabina reached out to Grotke and Jacobs, as well as librarians from local New York organizations. The New York Academy of Medicine (NYAM) was to be the host of the 18th International Conference on Grey Literature—published material produced outside of commercial or academic publishing, which is often not easily accessible—from November 28–29, and Rabina proposed that it also host an EOT nominating session that week. On December 1, ten volunteers at gathered at NYAM library to spend an afternoon identifying URLs. Rabina provided a handout with instructions for the nomination process—although it is relatively straightforward, certain kinds of content, such as PDFs or FTP (file transfer protocol), are not readily crawlable by Heritrix and need to be traced back to an originating http or https URL. Rabina’s handout also identified areas for participants to explore, mainly subdomains of science.gov, which may be most at risk of changing after the transition; when new reports are commissioned, for example, there is no requirement to save older ones. “The law has different levels of requirements for preservation, and for maintaining versions, for different types of government information,” explained Rabina. “The ones that are afforded the most protections for preservation and retention are things from the legislative branch, like our laws and bills and budget. But the stuff that comes from the agencies… like EPA reports about the level of water toxins in New Jersey, a lot of that just isn't retained and gets lost.” Especially with new agency heads in a new administration, she added, “their own vision can be anything from 'I don't believe in global warming' to 'I just want to update this website because it's ugly, so let's throw it all out.'" In addition, noted Jacobs, “Government agencies change their content management systems all the time. You might have a link to a document or webpage that you like or want to use for your research or have an interest in, and that URL could change.”

THIS YEAR’S CROP

The 2016 harvest is projected to be the largest yet. "The first thing to stress is that it's not coming out of any sort of paranoia about the new administration, necessarily,” Jacobs told LJ. “This is our third go-around. We're basically focusing on the dot-gov, dot-mil internet domain, and trying to collect as much information from those domains as we can in order to put a marker there for every four years." Groups at other institutions, including Simmons College and Brandeis University in Massachusetts and the University of Toronto, have also expressed interest in convening nominatathons. Rabina has put together a new handout for interested participants, with an eye to the Bureau of Labor Statistics, OSHA, NASA, and the Transportation Research Board. "We’re focusing on areas where we feel that there will be leaders who will bring another vision,” Rabina told LJ, “All of the science and the environment stuff. As far as I'm concerned, Department of Defense is kind of my last priority to get to." The EOT Archive has asked for particular assistance in identifying Judicial Branch websites; important content or subdomains on very large websites, such as NASA.gov, that might be related to current presidential policies; and government content on non-government domains, such as .com or .edu. The nomination tool allows users to submit URLs for inclusion and lets the EOT Archive filter out duplicates. In-scope web pages include those of federal government websites and social media accounts—particularly those that may change significantly or disappear during the transition. Local or state government websites, and any non-government sites including news sites and those documenting the U.S. elections, are out of the EOT Harvest scope. Each URL submitted is assigned a weighted score by project specialists, according to whether it is in or out of scope and its priority for crawling. Volunteers are asked to submit some simple metadata, including the nominated site’s title, agency, branch, and comments. While these are not required, they help identify resources for future reference. A bookmarklet is available for Firefox, Google Chrome, and Internet Explorer on the EOT nomination tool website. Nominators can also submit via a simple Google form. The Internet Archive will perform a comprehensive crawl across the entire .gov domain, supplemented by in-depth crawls by partners and volunteers based on the submitted lists of URLs. "We also ramp up our own collecting of government websites during this time to share with the project,” noted LC’s Grotke. “We're already collecting house and senate sites, legislative branch content, some executive branch content…. We're getting a little bit more in-depth coverage before and after inauguration day."

NOMINATIONS WELCOME

Anyone interested in helping nominate websites for collection can email the EOT team at uc3@ucop.edu, or consult the EOT Web Archive site for more information. "We would welcome any nominations of federal government websites,” said Grotke. “Nominate sites you feel are important or most at risk of disappearing or changing. We recommend including both top level (e.g. epa.gov) as well as subdomains (nepis.epa.gov). You might want to pick a topic to focus on, but we’re happy to accept any and all nominations you come up with. One way you could do this is to do searches for topic(s) of interest and include the .gov search parameter (“environment site:*.gov”). That will only search .gov domain for that keyword and you’ll quickly find the government sites of interest to you. Don’t worry about whether your nominated site has already been nominated. We’ll de-duplicate our list of seeds." She added, "This time around we're really excited by all the community engagement like [Rabina’s] events she's holding in New York...and also there have been these self-organizing groups,” Grotke told LJ. “They're just sort of emerging with communities...that are concerned about the subject matter or just interested in the project and nominating websites." Save
Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.


RELATED 

ALREADY A SUBSCRIBER?

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

ALREADY A SUBSCRIBER?