Pulling text out of images has never been easier than it is today thanks to optical character recognition (OCR) technology.
OCR allows us to do all kinds of useful things, like searching for images using text queries, reproducing documents without typing them out by hand, and even converting handwritten text to digital text.
But what is optical character recognition? How does it actually work? It may seem like black magic to you, but by the end of this article, you’ll have a solid understanding of how computers can recognize letters and words.
How Optical Character Recognition Works
To understand how text gets extracted from an image, we first have to understand what images are and how they’re stored on computers.
A pixel is a single dot of a particular color. An image is essentially a collection of pixels. The more pixels in an image, the higher its resolution. A computer doesn’t know that an image of a signpost is really a signpost—it just knows that the first pixel is this color, the next pixel is that color, and displays all of its pixels for you to see.
This means text and non-text are no different to a computer, and that’s why optical character recognition is so difficult. With that in mind, here’s how it works.
Step 1: Pre-Processing the Image
Before text can be pulled, the image needs to be massaged in certain ways to make extraction easier and more likely to succeed. This is called pre-processing, and different software solutions use different combinations of techniques.
The more common pre-processing techniques include:
Every single pixel in the image is converted to either black or white. The goal is to make clear which pixels belong to text and which pixels belong to the background, which speeds up the actual OCR process.
Since documents are rarely scanned with perfect alignment, characters may end up slanted or even upside-down. The goal here is to identify horizontal text lines and then rotate the image so that those lines are actually horizontal.
Whether the image has been binarized or not, there may be noise that can interfere with the identification of characters. Despeckling gets rid of that noise and tries to smooth out the image.
Identifies all lines and markings that likely aren’t characters, then removes them so the actual OCR process doesn’t get confused. It’s especially important when scanning documents with tables and boxes.
Separates the image into distinct chunks of text, such as identifying columns in multi-column documents.
Step 2: Processing the Image
First things first, the OCR process tries to establish the baseline for every line of text in the image (or if it was zoned in pre-processing, it will work through each zone one at a time). Each identified line of characters is handled one by one.
For each line of characters, the OCR software identifies the spacing between characters by looking for vertical lines of non-text pixels (which should be obvious with proper binarization). Each chunk of pixels between these non-text lines is marked as a “token” that represents one character. Hence, this step is called tokenization.
Once all of the potential characters in the image are tokenized, the OCR software can use two different techniques to identify what characters those tokens actually are:
Each token is compared pixel-to-pixel against an entire set of known glyphs—including numbers, punctuation, and other special symbols—and the closest match is picked. This technique is also known as matrix matching.
There are several drawbacks here. First, the tokens and glyphs need to be of similar size or else none of them will match. Second, the tokens need to be in a similar font as the glyphs, which rules out handwriting. But if the token’s font is known, pattern recognition can be fast and accurate.
Each token is compared against different rules that describe what kind of character it might be. For example, two equal-height vertical lines connected by a single horizontal line is likely to be a capital H.
This technique is useful because it isn’t limited to certain fonts or sizes. It can also be more nuanced in recognizing the subtle differences between a capital I, lowercase L, and the number 1. The downside? Programming the rules is much more complex than simply comparing the pixels in a token to the pixels in a glyph.
Step 3: Post-Processing the Image
Once all the token matching is finished, the OCR software could just call it a day and present the results to you. But usually a bit more fudging needs to be done to make sure you aren’t rolling your eyes at gibberish results.
All words are compared against a lexicon of approved words, and any that don’t match are replaced with the closest fitting word. A dictionary is one example of a lexicon. This can help correct words with erroneous characters, like “thorn” instead of “th0rn”.
When OCR is used in niche settings, such as for medical or legal documents, a special kind of OCR may be used that’s specially designed for that setting. In these cases, the OCR software may look for math equations, industry-specific terms, etc.
This advanced technique corrects sentences by using a language model that describes how likely certain words are to be followed by other words. It’s similar to the technology that predicts what word you want to type next on a mobile keyboard.
When done well, this can result in text that’s remarkably readable.
Recommended Optical Character Recognition Tools
Now that you know how OCR works, it should be easy to see that not all OCR tools are made equal. The accuracy of your results will depend heavily on how well the software implements the various OCR techniques discussed in this article.
We highly recommend OneNote for this, which is just one reason why it beats Evernote for note-taking. If you’re willing to pay for a premium solution, consider OmniPage. See our comparison of OneNote vs. OmniPage for OCR. For mobile documents, you’ll want to check out these OCR apps for Android devices.
How do you use OCR? Have any favorite OCR tools we didn’t mention? Let us know in the comments below!