Review Roundup

Top-3 tools for recognizing text inside scans

Scanned hard copies are often no more than simple images with no recognized text.

Given the state of record keeping at some organizations, it can sometimes seem like a major coup when reporters receive documents from governments and other sources in electronic format — until they look a little closer.

Frequently, we end up with little more than an image: an unsearchable, unselectable scan of a hard copy record.

That’s where optical character recognition can help. OCR software uses pattern recognition to convert images of letters into text, embedding this copy right into the document. The result — at least ideally — is a fully searchable PDF with text that can be copied, pasted and analyzed by computer must faster than we otherwise could manually.

OCR isn’t new, but it’s still not perfect. There are also a lot of options. So far, we’ve tested five of the most popular among journalists.

If you’re looking for a good tool to use on a specific sort of document, click through the reviews to our test results to see a more detailed breakdown. We tested these products on a set of memos, government forms, printed transcripts and a report with partially recognized text.

We’ve also boiled everything down to our top-three recommendations.

With DocumentCloud, free is hard to beat

Optical character recognition is only part of what DocumentCloud does, and it’s a mediocre solution. But as a free service that includes a slew of other document management and analysis features, it’s an excellent first stop for most projects.

“DocumentCloud is like a Leatherman for reporters,” Dave Gulliver, who tested and reviewed the product for the Reporters’ Lab, said. “It’s not the same as having a workshop of fancy power tools, but it can do a bunch of jobs. And it has a few widgets that look interesting, even if you’re not quite sure what to do with them.”

One of the best parts of DocumentCloud is that it’s a tailor-made solution for reporters (you have to be a journalist to get access). That means the team behind it is constantly on the lookout for new, relevant features and is way more responsive to questions and comments from their more narrow user base.

Sign up for an account and you won’t regret it.

More accuracy, conversion options with OmniPage

If you’re looking for a more accurate OCR solution, OmniPage may be worth its $150 price tag. The cost is even easier to justify given its long list of features, which include its ability to turn tables locked away in PDFs into sortable spreadsheets.

When it came to OCR, OmniPage was one of our top performers. And although there is a bit of a learning curve, it’s not as steep when you’re just trying to recognize text. Details like a split-pane editor and a proofreader also allow for fixes, which can be saved and incorporated into the PDF file.

OmniPage’s weakness is with larger files. They can take a long time to process, and the program can even die mid-document.

ABBYY FineReader excels as a standalone solution

ABBYY FineReader, on the other hand, had no problems with these massive files. It tore through them quickly and with slightly more accurate results than OmniPage.

That’s not surprising, since OCR is all FineReader does.

It’s a little more expensive ($170), meaning these sorts of documents need to be common in your newsroom to justify the cost. But if it’s the way you end up going, it will definitely save you time.

“We all know the pain of those non-searchable pdfs. And it’s usually a 100-page report. Although not perfect, ABBYY made it easy to convert the document and find what I needed,” reviewer Jennifer Wig said. “I’d definitely recommend this to journalists trying to sift through a lot of data for the gold nuggets.”

While these three options will certainly reduce your workload, they can’t solve all the problems with transforming hard copies into electronic text.

Some existing products can recognize handwriting, for example, but only after extensive manual training. Noisy scans, which obscure text with markings and make it harder to process, are as tricky as small type and low-resolution images.

Problems like that will certainly mean more work, but these three products are your best bet for taking care of everything else.

About Tyler Dukes

Tyler Dukes is the managing editor for Reporters' Lab, a project through Duke University's DeWitt Wallace Center for Media and Democracy. Follow him on Twitter as @mtdukes.
comments powered by Disqus

The Reporters' Lab welcomes relevant discussion from readers, but reserves the right to remove comments flagged as inappropriate or spam. The lab is not responsible for the content of user comments and cannot guarantee their accuracy.