Wow - this is amazing. Right this very moment (well, a few moments ago when I wa...

vidarh · on April 22, 2014

There's actually been a bit of research on the error rates you need to beat for OCR to be cost-effective vs. having people re-type. I don't have the references handy, but I believe it's generally cost effective to OCR with error rates up to nearly 2%, and most current "consumer grade" OCR is well below 1% error rates for scans that aren't absolutely atrociously poor quality.

My Msc thesis was on reducing OCR error rates by pre-processing of various forms, and while I managed to get some reduction in error rates, one of the things I found was actually that given how low the error rates generally was to begin with, you have a very tiny budget in terms of extra processing time before further error reduction just isn't worth it - if a human needs to check the document for errors anyway, a "quick and dirty" scan+OCR is often far better than even spending the time to get "as good as possible" results. Spending even a few extra seconds per page to place the page perfectly in a scanner, or waiting a few extra seconds for more complicated processing, can be a net loss.

It's a perfect example of "worse is better": OCR, at least for typed text, is good enough today that the best available solutions aren't really worthwhile to spend resources on (for users) unless/until they give results so perfect it doesn't need to be checked by a person afterwards.

WalterBright · on April 23, 2014

It was suggested to me by a friend that to get good OCR results, run it through the scanner/OCR twice, then diff the results. Usually one or the other will get it right, and if you run the two results through a difference editor like 'meld', it's quick to fix.

vidarh · on April 23, 2014

That may work for some cases, and especially with horrible OCR engines and low quality scanners, but frankly when I did my research into this, the results varied extremely little from run to run, and you could usually easily identify specific artefacts in the source that tripped the engine up (rather than problems with the quality of the scan). E.g. letters that were damaged, or had run together, creases in the paper etc.

With really low res scanners I can image it could make a big difference.

Corrado · on April 28, 2014

Back in the late 90's I worked for a company that did a lot of OCRing and they ran the same image through multiple engines and then manually corrected the results. I think they had 3 engines, all from different companies, which processed all images and put the results into a custom format. Human beings were then employed to manually merge and correct the final text. It worked fairly well, especially considering the hardware/software available at the time.

The biggest problem was stuffing too many files into an NTFS directory. Apparently, NTFS didn't like tens of thousands of files in one directory. :)

TheLoneWolfling · on April 23, 2014

What about running it through two+ different OCR engines?

netcan · on April 23, 2014

If this is done all in software (IE, it isn't analyzing a slightly different image), why wouldn't the OCR just do this itself?

timdiggerm · on April 23, 2014

Somebody's got to decide which way to go with the diffs

ithkuil · on April 23, 2014

majority out of an odd number of runs?

tlrobinson · on April 23, 2014

"There's actually been a bit of research on the error rates you need to beat for OCR to be cost-effective vs. having people re-type."

Doesn't that depend entirely on what you're using the text for and how accurate it needs to be?

vidarh · on April 23, 2014

To a certain extent, of course. The 2% was based on the assumption that if you are benchmarking against re-typing, you expect the same kind of quality you'd get from having a good typist re-typing the documents.

From my own experiments, I tend to find that you can read through and correct errors only relatively marginally faster than you can type because you either follow along with the cursor or need to be able to position the cursor very quickly when you find an error, and as the error rate increases, trying to position the cursor to each error very quickly gets too slow.

Dropping accuracy in your effort to correct the text doesn't really seem to speed things up much. You likely speed it up if you're willing to assume that anything that passes the spellchecker is ok (but it won't be, especially as modern OCR's often try to rely on data about sequences of letters, or dictionaries, when they're uncertain about characters)

If you're ok with lower accuracy, e.g. for search, and the alternative is not processing the document at all, then it'd be drastically different.

goldenkey · on April 23, 2014

Time is not as relevant as energy when we are talking about people whose jobs have a huge strain.