Optical Character Recognition or OCR technology is the electronic conversion or translation of visual text (typed, handwritten or machine printed) into “machine-encoded’ text. The intent of this conversion is to leverage the resulting information using other computer facilities such as machine translation, full-text searching, converting text to speech, automatic document classification and metadata tagging, data mining and myriad other capabilities. I believe OCR to be one of the earliest forms of an artificial intelligence application. OCR was a ‘game-changer’ in computer technology as it allowed businesses to convert their office documents and records into searchable text, digitize paper for publishing on the web and facilitating the ability of persons with sight or cognitive disabilities to access a rich assortment of technical and literary materials previously unavailable to them.
In this article, I will venture to take some of the mystery out of OCR technology and offer some insight into how it can be used and to some extent offer tips on how to access its power and avoid some of the technical and implementation pitfalls. Books can (and have) been written on these topics and my article here is not the forum for such a detailed discussion. However, feel free to contact us at CDI or check out our BLOG, website or Facebook page for more information or should you have questions!
1. New Technology? Nope! – OCR technology has been around and constantly evolving for over 80 years. In 1950, David Shepard built and patented the first machine to perform OCR. Shepard was a cryptanalyst for the agency that later became the National Security Agency (NSA) and who subsequently founded the Intelligent Machines Research (IMR) company. Since then, OCR has been undertaken by the likes of IRIS, ABBYY, and many others bringing a diverse set of technologies and solutions to bear against paper-based text.
2. How it works (in its simplest form) – OCR combines machine vision, pattern recognition and artificial intelligence in a comprehensive analysis of written text displayed within an image file. To begin, the documents to be processed need to be converted to digital images. Typically this is done by scanning the document to produce an image file; a picture if you will. In essence, you start with a picture or digital image file of your hardcopy source document, run an OCR process (software) against it and this produces or outputs text to a file of some kind. Once converted to machine text, you can do virtually anything with it, such as post it to the web, produce speech as in audio books, and search in very large documents or groups of documents for key words. The software itself ‘looks’ at each character and compares that character against a known character it its database. Typically the characters are ranked in probability depending on how confident the software is about what it thinks it sees and then attributes a value to that character. Hence, the image of a C becomes the letter C in machine language. As stated, this is a very simplistic description of what occurs and each software package implements and analyses differently, but the basics are the same and the results are all about achieving the highest confidence in its accuracy.
3. Accuracy – The first thing to say here is that currently there is NO software/hardware combination that produces 100% accurate text from an image. Flawless OCR results are frequently approached but rarely attained. This is probably one of the biggest misconceptions about the technology that users have before buying into it, but not one that persists after its initial use. Uninitiated end users of OCR technology can be disappointed in the results they get. But let’s face it; after decades of research and development by some really smart folks, the best we can hope for currently ranges from 78-98% accuracy when converting text from image. Attaining one hundred percent accuracy requires an iterative human component to be part of the quality process.
Image quality is the Number 1 factor impacting the recognition accuracy of OCR software. Think of it in this way; if you have trouble reading the text on the screen after scanning, the software will be less likely to correctly translate it to text. The closer to a pristine page of text with good contrast and low noise the higher your recognition rates will be. If the majority of your collection is copies of faxes, you’re likely wasting your time trying to OCR it.
4. Handwriting: the bane of ICR! – For the folks new to this, ICR stands for Intelligent Character Recognition and generally refers to the conversion of handwriting to machine text. Everything in paragraph 3 above refers to type-written text and does not apply to handwriting. Converting hand writing to machine text is a tough nut to crack and although there have been great strides in recognizing handwriting, the success rates are generally atrocious and can vary wildly even within the same collection.
5. So, why bother? – You may be asking yourself “If OCR is not completely reliable and can be so much work, why would I want to do it?” This is probably the BEST question you can ask before you commit to the project. As I said from the beginning, having machine text of your documents is potentially a very powerful tool that can reduce workload and improve efficiencies and save or make money for your organization. Let’s look at the potential positive aspects:
a. Full-text Search Capability – The ability to search across the body of your content can add precision to your searches. It is important to note that full-text search alone is not very helpful in finding what you are looking for two reasons; errors in the text during OCR and the potential that your search string is pervasive throughout the document collection. While great for finding formal names or words that have few occurrences, FTS works best when used in conjunction with a fielded search using metadata attributed at the document level.
b. Text-to-speech, language translation – The obvious accessibility benefits of these two capabilities opens a whole world of information and education to multitudes of people. Enough said…
c. Automatic Metadata tagging and classification – When you have a large collection of documents that require indexing, this technology can pay for itself in pretty rapid fashion and reduce the accessibility lead times to your content. Essentially, this software uses the OCR’d textual content to attribute metadata (index information) at the document level and/or use the content to classify the document into a user-defined hierarchy (think folder structure).
d. Data mining/analysis – Most integrators tout search and retrieval as the premier benefit of OCR. In many instances, I would disagree. We are being buried in information to the point where we can no longer make sense of what is relevant and what isn’t. Nor can we develop interrelationships between documents in very large collections. Many of my peers neglect to tout the analytic capabilities that become available with textual content. Using a bit of imagination, I am sure you could find such a use in your organization. This technology is successfully being used by law enforcement and intelligence agencies to find evidence and to identify relationships between people, places and things. While an excellent tool in the battle against terrorism and white-collar crime, it can be used in many commercial industries such as banking and finance, service organizations, local governments, etc.
6. Miscellaneous Considerations – OK, so I have perhaps convinced you or dissuaded you from leveraging the power of OCR. If either is true, then this article has done at least part of what I intended. The most important thing to remind yourself is that the technology is not bulletproof (yet) and not to use it as your sole means of search and retrieval. Before I go, here are a few other things to think about.
a. Speed – While one may think that the few seconds it takes for an OCR engine to process a page couldn’t possibly be a show-stopper, consider this: I have implemented solutions where it has taken over six months to convert images to text. The sheer volume of some collections can create this type of bottle-neck. During design remember that and take into account your computers’ processing power (processor, memory, disk speed and capacity) and network bandwidth. Some OCR engines are much faster than others. But keep in mind that in general you trade speed for accuracy.
b. Format – There have been many improvements in the software’s ability to replicate a page’s format. Tables and columns can be challenging for some software. When deciding on a product, remember that some are better than others and nobody’s perfect.
c. Cost – You can spend $49 for an inexpensive software package online or at your local Wally World and you can spend tens or even hundreds of thousands of dollars on software capabilities consistent with what you need. If your needs have any level of complexity, contact an integrator for advice. Do your homework!
d. Related and Emerging Technologies – If check boxes, bubbles and such disturb your sleep, you can relax. Optical Mark Recognition (OMR) is a fully robust capability (meaning it works) and is available in many software packages. And as I said before, ICR can be used to leverage handwriting, but I consider the technology limited. It works best when the handwriting is ‘constrained’ such as when you fill in those little boxes with the individual letters of your name on some form. Other technologies are emerging such as Intelligent Word Recognition or IWR, where “Intelligent Word Recognition, is the recognition of unconstrained handwritten words. IWR recognizes entire handwritten words or phrases instead of character-by-character, like its predecessor, IWR technology matches handwritten or printed words to a user-defined dictionary, significantly reducing character errors encountered in typical character-based recognition engines.” Suffice to say that the technology continues to improve, but make sure to level your expectations with actual capabilities given your content and environment.
I hope that this article has been helpful to you. Should you have any questions, feel free to contact me. No cost, no obligation. Thanks for reading!
Director of Operations, Federal & Mid-Atlantic