Adam Matthew Enables Full-Text Search of Handwritten Manuscripts

Adam Matthew Digital last month announced the launch of Handwritten Text Recognition, an artificial intelligence technology that enables full-text searching of digitized, handwritten manuscript collections.
Adam Matthew Digital's Colonial America Handwritten Text RecognitionAdam Matthew Digital last month announced the launch of Handwritten Text Recognition (HTR), an artificial intelligence (AI) technology that enables full-text searching of digitized, handwritten manuscript collections. “It continues to return really remarkable results on even poor quality hand writing,” Glyn Porritt, head of Technical for Adam Matthew, an independent subsidiary of SAGE, told LJ. “We have undertaken research on samples of material and our estimates are an equivalence of 90 percent accuracy.” Handwriting recognition technology has been used for more than two decades for purposes such as signature verification at banks and mail sorting at post offices. However, the handwritten address interpretation systems used to sort mail can narrow and validate results by relying on zip code directories and databases of known addresses. Modern notetaking apps on tablets and smartphones use machine learning to adapt to the handwriting of a device’s most frequent user. HTR is unique because it has neither of these advantages—the system is working with collections of source material written in a variety of scripts, with no pre-existing databases or transcriptions to cross-check. Adam Matthew has been investigating the possibility of using HTR for handwritten primary source collections for a number of years, Porritt said. “However, it is in the last few years that they have seen significant progress in the development of AI technologies in this area.” Standard Optical Character Recognition (OCR) software can fail to decipher text even in printed documents if they have uncommon typefaces, unusual spacing, stains or water damage, or fading. Even with manuscripts written in a legible, consistent hand, HTR is grappling with some combination of these issues. “Over the years, we have certainly faced a variety of challenges regarding age and quality of typesetting having an impact on the quality of full text OCR search results,” Porritt said. “We have always invested in high quality scans, cleaned up text where necessary, and in recent years we have found software solutions to 18th century fonts and Gothic texts…. Handwriting takes this challenge to a whole new level. This is especially the case in our circumstance of working with very large volumes of manuscript material in multiple hands.” Porritt said that Adam Matthew “had researched the prospect of providing additional [OCR] training for a certain style of hand writing or support from keyed transcriptions. However, this technology delivers search results without such additional requirements, and as a result has dramatically broken down barriers to deliver HTR for large primary source collections.” HTR utilizes neural networks that train the software to recognize a wide variety of handwritten characters in their linguistic context, Porritt said. But, the system doesn’t generate transcripts of these source documents. Instead, search results are supported by algorithms that assess the probability of characters matching the words in a user’s search. Search results are displayed as snippets from the manuscript. Users then select a snippet and are directed to the page of the manuscript where the search result appears. Adam Matthew launched HTR last month with Colonial America, Module III: The American Revolution, which includes “intercepted letters between colonists, the military correspondence of the British commanders in the field, as well as two copies of the ‘Dunlap’ edition of the Declaration of Independence printed on the night of the 4th–5th July 1776,” Porritt said. The complete Colonial America collection, once all five modules are released, will consist of over 750,000 pages and 160 million words of original correspondence between the British government and the governments of the American colonies, 1606–1822 (CO 5 series from The National Archives, UK), making HTR a vital tool for navigating this content. “Manuscript volumes rarely have indexes,” Porritt noted. “Keywords and metadata have traditionally brought the researcher towards the relevant document but they then have to find pertinent areas of that work themselves. With HTR technology, the user can be taken straight to a highlighted word or words.” The team at Adam Matthew has also begun experimenting with automated keyword lookups to flag the frequency of different terms used in the collection. “We think this is just the start of opening up a range of data mining opportunities that will continue to increase in the future as we continue to develop the great potential of this technology,” Porritt said. “There is no doubt that it has an exciting future.” In addition to Colonial America, Florence Nightingale correspondence in Adam Matthew’s Medical Services and Warfare collection is now HTR searchable, and the company is in the process of indexing over one million pages of content for its East India Company collection, which is scheduled for release in early 2018. “Given the enthusiastic response to the HTR functionality in Colonial America, we are keen to follow this up and further enhance the research opportunities of additional manuscript content,” Porritt said. “We will be reviewing suitable content for future collections as a priority during the rest of 2018. The Mass Observation archive, for example, is one we have noted would see great benefits from HTR searchability.”

No Comments to this Article. Be the first user to comment.

RELATED 

TOP STORIES

LIBRARY EDUCATION

Kids are using VR to explore worlds and create new ones

COMMUNITY FORM

Kids are using VR to explore worlds and create new ones

COLLECTION DEVELOPMENT

Kids are using VR to explore worlds and create new ones

Get connected. Join our global community of more than 200,000 librarians and educators.