CMU Receives NEH Grant To Develop Bibliographic Tools

“Freedom and the Press before Freedom of the Press,” a digital humanities project based at Carnegie Mellon University (CMU), Pittsburgh, has received a $324,931 National Endowment for the Humanities Digital Humanities Advancement grant to develop a set of digital tools to analyze type and paper used in late 17th- and 18th-century English language works.

rows of slightly different capital A's and O's
Examples of distinctive letters from CMU's Digital Library of Distinctive Type

Before the First Amendment and freedom of the press protections in place today, printers often chose not to attach their names to politically sensitive books and pamphlets for fear of persecution or punishment. This has proved a challenge for contemporary scholars who want to track their origins.

“Freedom and the Press before Freedom of the Press,” a digital humanities project based at Carnegie Mellon University (CMU), Pittsburgh, has received a $324,931 National Endowment for the Humanities (NEH) Digital Humanities Advancement grant to develop a set of digital tools to analyze type and paper used in late 17th- and 18th-century English language works. The Digital Library of Distinctive Type and the Coloring Book Paper Analysis Tool will compile distinctive identifiers of 241 late 17th-century British printers, helping researchers, scholars, and students connect unattributed works to the printers that published them.

“There’s a surprisingly large ratio—something like a third to a half of the books that we have from this period, we don’t actually know who printed them,” project co-lead Matthew Lincoln, currently senior software engineer for text and data mining at JSTOR Labs, told LJ. (Lincoln was formerly at CMU; the CMU project is unaffiliated with JSTOR.) “We know who the author is, but the actual person who had the printing press, who was in charge of creating the paper copies that we have today, is unknown. This large swath of information is missing from how we think about these books.”

The project is led by Christopher Warren, associate professor of English and history and associate department head of English; Max G’Sell, assistant professor of statistics and data science; Samuel Lemley, curator of special collections at CMU Libraries; and Lincoln. Team members include CMU history PhD student Kari Thomas; Simmons University library and information science graduate student D.J. Schuldt, who has a PhD from CMU; and longtime collaborators Assistant Professor Taylor Berg-Kirkpatrick and graduate student Nikolai Vogler, both at the University of California, San Diego (Vogler earned a Masters at CMU); and Kartik Goyal, a research assistant professor at Toyota Technological Institute at Chicago, who received his PhD from CMU.



Speaking out against the state or church in the late 17th and early 18th centuries carried a high degree of risk. Writers and printers who challenged the primacy of the monarchy, or raised questions about religious orthodoxy, were subject to persecution for treason or libel. Printers obscured their origins in a number of ways, such as publishing anonymously or listing a different city from the one where content was printed.

But the letter forms—pieces of movable type—used at the time, made of alloys of lead, antimony, and tin, were subject to damage and degradation over time—and this damage provides critical clues. Bibliographers tracing the identity of clandestinely printed texts have traditionally done the work by examining multiple prints of a book or pamphlet, painstakingly comparing the letter forms to identify similar faults in individual letters that can then be identified with a particular printer.

“That kind of work is incredibly labor and time intensive, noted Lemley. “Before digitization, it required that the researcher travel to multiple institutions and collections to look at copies of the books they were working on.” When scholar Charlton Hinman prepared the Norton facsimile of Shakespeare’s 1623 First Folio in the 1950s, he spent years sorting through copies looking for minutely damaged typographic characters, to determine the sequence in which copies were printed.

Bibliographic detective work “was a process of looking very, very closely at two books and comparing character to character across thousands of characters,” explained Lincoln. “The initial question was, is there some way that we could leverage computer vision—not to replace the human being in this process, but to help them radically cut down the number of characters and the number of pages that they have to look at?”



The first phase of the project involved National Science Foundation (NSF)–funded computational and statistical work to develop methods for analyzing large quantities of letter forms from high-resolution scans, sorting them into sets of anomalous characters within the same font face. A team from the Carnegie Mellon Language Technologies Institute refined the optical character recognition (OCR) methods needed. Scanning manually typeset documents involves a different process from scanning contemporary text, Lincoln explained: “They’re dealing with discolored pages, and ink spreading from heavily inked characters, or characters that don’t have enough ink and are blotchy and patchy.”

The scans come from CMU’s and other rare books collections worldwide, such as the British Library and the Folger Shakespeare Library. The grant includes funding to digitize more manuscripts and books, which will make a new body of work accessible to the general public as well as adding to the project’s corpus. Even existing scans created by Google as part of its digitization project 10 years ago are now outdated—the colors are inconsistent and the resolution is inadequate—and should be redone, Lemley told LJ. There are also funds for CMU to purchase manuscripts that the project team can use.

The project also serves to link analysis of physical texts with the increasing emphasis on digital material. “It’s doing traditional bibliographical work, but on computational scales and at computational speeds,” said Lemley. “The project is acknowledging that by working with libraries to get better and new kinds of scans done [people can] ask better and new kinds of questions of typographic and bibliographical evidence.”

The second phase compiled statistical models, assembling searchable clusters of damaged type pieces. For example, “the printer may have a couple of dozen lowercase A’s in their typeset. What is the average shape of an A in this particular typeface, for this particular font?” said Lincoln. “Once we’ve calculated that sort of ideal A character, then we go back through the individual character images to find which A’s are significantly different from this one”—worn or damaged characters hold ink differently from intact type. “That gives us a kind of fingerprint.”

With the help of the NEH grant, the team will be able to further refine the accuracy of the OCR work by bringing in human eyes. Rare books librarians, English and history faculty, and graduate students will work with the subsets of characters that have been selected computationally and look for patterns among books and known printers.

“It’s a spectrum that goes from the computer doing an enormous amount of the work to the end where we’ve winnowed down the interesting [text],” said Lincoln. “That’s where we focus our human domain experts who can bring in that historical context, who can bring everything else that they know about the book that the computer doesn’t know.”



Similarly, the Coloring Book Paper Analysis Tool identifies paper color profiles via high-res sample scans, tracking minute differences in color and fiber density.

Books at the time were printed on sheets of paper that often came from different sources, places, or years. “It matters if we’re able to look at a particular book and say, ‘Okay, the first 15 sheets in this book were printed on one stock of paper, and the last 40 sheets were printed on another,’” said Lemley.

It can also be used to detect forgeries. In the early 20th century, Lemley explained, if a rare book was missing pages, an unscrupulous bookseller might insert those pages from another copy and sell it as a complete work, or make facsimiles on different paper from the period.

Users scan a text to get a histogram—a reading of color on each page—and the tool shows where differences appear throughout the book. “It’s something that’s indistinguishable to the naked eye,” said Lemley. “This is the advantage of using computer vision.” While the tool doesn’t replace value of looking at watermarks or chain lines—the grid of vertical lines formed by the paper mold—that information is often invisible on scanned documents; the tool adds another data point when identifying or dating paper.



Ultimately, the project will result in a website with a curated set of distinctive type images that rare books librarians, bibliographers, historians, archivists, scholars, and students can consult, along with other bibliographical reference works. “At the end of the day, basically produce a compendium of distinctive type that regular non–computational experts, historians and bibliographers and special collections librarians, can use” for their own collections, said Lincoln.

The team will also share its tools; the OCR engine is open source, as is the sorting interface.

The end goal is a sustainable reference toolkit that will live on beyond the three-year grant period. “If you wanted to look up a particular printer, say, Robert Everingham, a printer in the 17th century, you could go to this resource, type in his name, and you’d get basically a grid of 100 letters that are distinctively damaged,” said Lemley. “Then you could then search for those letters in whatever book it is you’re working with.”

Next year will mark the 400th anniversary of Shakespeare’s First Folio, CMU Special Collections will be building an exhibition around its copy in partnership with the Frick Pittsburgh museum. The team, which will eventually use its tools to attribute the printing of the Third and Fourth folios, hopes that the exhibition can feature its findings.

While the CMU team is studying printing in Restoration England, they hope that other digital humanists will use the tools to explore other regions and periods of printing history. These questions of provenance are not only about the printing industry itself, Lincoln noted, but the larger questions behind the texts and where they were circulated.

“We study a lot of these works by great authors, works that we know were very important historically, culturally, politically, and socially. How did those words get to be important? What were the mechanisms behind that? It’s never just one single person who is the seat of all the change that a great work can create,” he said. “What was their platform, what was the venue? Where were they allowed to speak? Where could they not disseminate those ideas? Where were those ideas being barred from the public? I think if there is a larger project here, it’s getting people to attend to kind of those material realities, and that history, just as much as they attend to the individual words that this person was writing.”

Author Image
Lisa Peet

Lisa Peet is Executive Editor for Library Journal.

Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.
Sorry !!! Your comment is not submited properly Or you left some fields empty. Please check with your admin



We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing