Next Gen OCR Project Reaches Back into Early English History (and Databases)

The eMOP project led by Texas A&M will use page images from ProQuest's Early English Books Online and Early European Books, Gale Cengage's Eighteenth Century Collections Online, and other sources to create a database of early typefaces used in English books and documents, and then train optical character recognition (OCR) software to read these documents.
OCR works great for paperbacks—but what about 15th Century texts set by hand?

Texas A&M’s Early Modern OCR Project (eMOP), which trains software to recognize texts from the 15th through 17th centuries,  has received two significant boosts in recent weeks. At the end of September, the project received a two-year, $734,000 grant from the Andrew W. Mellon Foundation. And, earlier this month, ProQuest began providing the project with access to its Early English Books Online and Early European Books databases.

The eMOP project will be using page images from these and other sources, including Eighteenth Century Collections Online, from Gale Cengage, to build a database of early typefaces used in English books and documents, and then train optical character recognition (OCR) software to read these documents. According to the current project timeline, the tools needed to convert most English texts from this historical period into fully searchable digital documents should be available worldwide within the next two years.

The project is a response to what many scholars view as an emerging problem for books published during this period.

“Collectively, the US, the UK, and scholars around the globe face a problem: rare books and pamphlets from the early modern era which have not yet been made available digitally threaten to become invisible to future scholars,” Laura Mandell, director of Texas A&M’s Initiative for Digital Humanities, Media, and Culture and primary investigator for eMOP, wrote in a grant proposal to the Mellon Foundation.

The problem, she told LJ, is that digital preservation efforts have often focused on preserving an image of a document and including basic metadata such as the title, subject, author, date, and publisher to make it discoverable.

With these documents, many of which were digitized from microfilm copies, “there’s a very slim amount of metadata that’s available,” said Mary Sauer-Games, ProQuest’s vice-president of scientific, technical, medical (STM) and humanities publishing. “There’s the original cataloging records, which have been enhanced with the English Short Title Catalogue. And our catalogers have done some additional coding, but you still don’t have the full text that you can search.”

Sea of Searchable Documents

Meanwhile, the volume of all types of fully searchable digital texts continues growing at an exponential pace. Documents from the early era of printing now risk getting lost in the shuffle.

“People could imagine that they’ve preserved something if they’ve just imaged it,” Mandell said. “And if it’s preserved in that way, and if researchers … think that getting online and searching is good enough, there are things that will never be found again.”

The medial s, which resembles a cursive “f,” is one of the quirks of early printing that now causes problems for standard OCR engines.

Discoverability isn’t the only problem. The less complete the collection from this period, the greater the possibility that it will lead to incorrect assumptions and mistakes by future digital researchers.

Mandell pointed to the Google Books Ngram Viewer for a simple example. Search for the word “presumption” between 1700 to 1900, and a user will find that usage peaked in the years following the publication of Mary Shelley’s Frankenstein. Based on that data, one might assume that this increase was spurred at least partly by the popularity of a novel about a scientist who presumed to create life.

This would be a tenuous hypothesis at best. During the late 1700s and early 1800s, printing presses around the world were eliminating the use of the medial, or long “s.” Search for terms like “prefumption” or “curiofity,” and you’ll find an overlapping decrease in the use of these words.

In a case like this, “what you get is a map of the history of typography, not a history of what’s going on culturally. That’s the problem,” Mandell said.

Printer Problems

The medial s is also one of the peculiarities of early modern printing that causes problems with current OCR technology. Automated digitizing of works from this era has always posed a challenge, since OCR engines have a difficult time recognizing and adjusting for the typefaces, irregular spacing, and other quirks from early printing presses.

“The fonts are so old and the typefaces used are so old, and the methodology with which they were printing these books was so different,” noted Sauer-Games. “We’ve run some OCR engines over them, and you just get garbage. It’s nothing you can really read.”

Printing from this era was very inconsistent, Sauer-Games explains.

“You think about the original printing presses—they weren’t exactly precise,” she said. “Somebody was actually sitting there, placing letters across the printer. There might be gaps that weren’t supposed to be there, or maybe the ink wasn’t evenly applied. And the books, over time, the ink may have faded. You’ve got large fonts mixed with smaller font sizes. And you think about all of these small, individual presses that were creating their own typefaces to print these books. It wasn’t consistent like it is today.”

Nothing troubles a computer program more than pervasive inconsistencies, but automated OCR will be crucial if scholars hope to make most of the documents from this era full-text searchable. Between Early English Books Online and Early European Books from ProQuest, and Eighteenth Century Collections Online from Gale Cengage, there are a combined 307,000 books and other documents from 1473 to 1800, totaling 45 million pages, Mandell said.

Projects like the Text Creation Partnership, led by the University of Michigan Libraries, have already made an admirable dent in this corpus by manually keying in the entirety of more than 40,000 texts. But that process is time consuming and can be expensive. eMOP is hoping that a combination of specialized OCR software and crowdsourced corrections will present a faster way to preserve a more comprehensive collection.

In-House Expertise

The eMOP project brings together experts from Texas A&M’s English literature department, its computer science department, its Initiative for Digital Humanities, Media, and Culture, and from groups including  IMPACT at KB National Library of the Netherlands, The Software Environment for the Advancement of Scholarly Research, Performant Software Solutions, and PRImA [Pattern Recognition and Image Analysis].

Librarians at Texas A&M’s Cushing Memorial Library teach an annual intensive course on early printing techniques, and are assisting with eMOP

And for expertise on early printing techniques and typography, they certainly didn’t need to look far. Texas A&M’s Cushing Memorial Library and Archives hosts the annual Book History Workshop, an intensive five-day course where students learn how printing was done in the hand-press era—even spending lab-time learning how to cast type using early methods.

“In addition to setting a lot of type and correcting it, and putting it on the bed of a common press and printing, we do an introduction to how typography worked,” said Todd Samuelson, director of the workshop, printer in residence, and curator of Rare Books and Manuscripts at the Cushing Memorial Library. Attendees “cast type from a matrix in a hand mould using 500 degree lead alloy and see how that process works… When Laura [Mandell] came over to look at our collections and saw how these questions of historical typography were an ongoing project for us, she thought we could have a really productive collaboration.”

Samuelson is now working with post-doctoral research fellow Jacob Heil to put together a database of common typefaces for the project.

“We’re trying to determine some of the most common, most significant type-faces in English typography from 1476 to 1820, and then [Mandell]’s team is trying to teach the OCR engines to read them. We’re going to see what kind of tolerances we can give [the system] in different type families.”

As a rare books librarian, Samuelson acknowledged that research is trending toward the convenience and accessibility of digitized, searchable text. An increase in searchable texts from the period will also have exciting implications for several fields in the humanities, including linguistics. But preservation remains important. For now, there are some things that full-text digital copies simply can’t communicate.

“There are types of evidence that can’t be translated through digital simulacra. The material object does have ways in which it signals to us what a book is about,” Samuelson said. For example, “a 17th century book of poetry, if it were published in a small, duodecimo volume, might signal something to its readers, whether it’s devotional or intimate in some way—the same way that you can look at a paperback and get a sense of the genre …These are the sorts of tangible qualities of the book that, at least in the near future, won’t be available in digital formats.”

Comment Policy:
  • Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
  • Don't use obscene, profane, or vulgar language.
  • Stay on point. Comments that stray from the topic at hand may be deleted.
  • Comments may be republished in print, online, or other forms of media.
  • If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.


RELATED 

ALREADY A SUBSCRIBER?

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

ALREADY A SUBSCRIBER?