We have a problem that only human eyes can solve. Yours can help.
Here’s some background. In Discovery, Technology and Publishing, we use optical character recognition (OCR) software to extract text from document images in order to make them machine-readable and searchable. In simple terms, the OCR process works through a bit of binary “yes/no” logic – either something exists in a given place, or nothing does. No matter what kind of image you put into the software (color, grayscale, whatever), the application creates a temporary black and white version. That is the version to which the “yes/no” operation is applied – the resulting pixel patterns in the image are compared to “known character” patterns. Different software packages use different logic, but in the end all those “known characters” get put together and output to a text file – or something similar.
In the past we’ve done a variety of things with these files – from loading the pure text content into searchable database fields (as in a previous implementation of our America at War collection), to embedding the text within image files (the Student Research portions of the UR Scholarship Repository), and applying extensive XML markup to historical documents, enabling customized searching and manipulation of information (see our site focused on the published Proceedings of the Virginia Secession Convention). For folks who are dedicated to going paperless, there are plenty of OCR applications available for mobile devices, too.
OCR is a great tool, but the technology has limitations. Depending on the printing process that created an original document, a capital S might look a bit like the numeral 5 as a result of artifacts on the paper, a smudge of ink, or damaged type. The type of original materials we’re working with makes a difference, too: the high-resolution camera we use to digitize rare materials at Boatwright results in fantastic images, but the best camera on the planet can’t change the fact that microfilm is, well, microfilm. It’s a great format for preserving content, but a lousy medium from which to digitize. Occasionally, microfilm is all we have to work from.
Take our Collegian collection, for example. As part of UR’s 175th anniversary about 10 years ago, the full-run of the student newspaper, the Collegian, was digitized. Most of these issues existed only on seldom-used reels of microfilm rather than paper, and, as a result of the age of the papers when they were initially microfilmed, many of the resulting images were not ideal for OCR purposes. The software knew that there where characters in the images provided, and oftentimes the resulting text was way off base. If you’ve ever tried to identify long-passed family members in old, faded photographs, you have an understanding of what the OCR software is going through: you know that the person you’re looking for is there – recognizing them among the crowd is the issue. Take that one step further by attempting to identify every individual, and you’ll have an idea of the computational difficulty that the OCR process can sometimes face.
Fast-forward to 2014, and our Collegian collection is still online – in fact, among our digital collections, the Collegian regularly receives the highest volume of traffic. The difficulty with OCR remains, though we’ve recently incorporated a mechanism which allows users to correct the text output of the OCR process. The changes made to the underlying text files are reindexed and searchable immediately upon saving – talk about instant gratification.
So if you’re someone who is interested in the history of the University of Richmond from the students’ perspective, I invite you to contribute a little bit of time to enhance this collection. Simply click the image below, then the “Register” link at the top of the collection home page to get started.