This is really cool. Use of GPT-3 to augment OCR is an amazing (and, retrospecti...

visarga · on Dec 14, 2021

It's an old idea, using a language model on top of character level OCR. Works well for general text but doesn't solve random sequences of digits and letters. So you can't use it to correct your invoices where you have lots of out-of-dictionary tokens.

bloak · on Dec 14, 2021

I've always found it somehow ironic that a human can correctly recognise printed characters even if parts of them are missing and the word is misspelt or in a language the human does not know at all, but computers have to resort to language models because an exact comparison of part of the image with other parts of the image (where the same letter is printed in the same font) for some reason is not feasible?

Super_Jambo · on Dec 14, 2021

Humans must be using a language model for image recognition when reading though. Otherwise things like failing to spot the the is a duplicate word wouldn't happen so often.

nephanth · on Dec 15, 2021

Another interesting quirk of human reading is that it is pretty agnostic to the actual order of letters in a word as long as the first and last one are placed correctly

Tcepsa · on Dec 15, 2021

I see what you did there (and I appreciate it! ^_^)

The-Bus · on Dec 15, 2021

It took your comment for me to realize what happened

Retric · on Dec 14, 2021

I think people use font models rather than just language models. The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.

AussieWog93 · on Dec 15, 2021

> The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.

To be fair, the amount of people who write their own address incorrectly is staggering.

I'm in eCommerce and easily 70%+ of addresses have some sort of minor error in them. Around 5-10% are just plain bizarre, with things like two suburbs being included or the street name not including the Rd/Av/Dr etc.

I'd suspect this is one of those problems that seems easy in the lab but quickly degenerates when you consider the human aspect of it.

IggleSniggle · on Dec 15, 2021

People do things like include two suburbs because they know that (eg) in order for it to reach their address correctly, it must first reach the human at point A that will correctly pass it to the human at point B, where it otherwise cannot arrive at point B because point B does not receive normal postal service, possibly because of a jurisdiction dispute in 1974 that placed their address in a zip code that’s different from all of their neighbors.

Can you tell that I know a bunch of people with these kinds of issues? You may already know that programmers mess up names all the time, telling people that their last name must be their surname, or that you can’t be an O’Reilly or a Robertson-Peele, or that “Mary Anne” is not a single name, or that your middle name cannot be your “primary” name, or that you must have more than one name, or that your legal name is invalid because it’s not the name you were born with, or that all 3yos have names, etc.

Well, take all of those issues, and add the vagaries of geography, and you’ve got mail delivery.

mauvehaus · on Dec 15, 2021

I have a weird address, and easily 30% of websites insist on fucking it up by applying validation rules that might make sense from 20,000 feet, but don't actually work in practice for our address.

The most straightforward of them is that some validation services insist that our ZIP code is for the next town over instead of the one we live in, which has its own post office. Nothing correct happens if our mail goes to the wrong post office because they (rightly) have no idea how to deliver mail to us.

I wouldn't be so confident that 100% of that 70% don't know their own address. For at least some of those cases, I'm willing to bet they know something you don't about the vagaries of mail and package delivery to their address.

Retric · on Dec 15, 2021

Modern postal OCR is generally good enough to detect bad addresses, but my point was people still beat it when the domain is so constrained.

throwawayboise · on Dec 14, 2021

Brains do patten recognition much better than computers (albeit slower)

Andrew_nenakhov · on Dec 14, 2021

For now.

einpoklum · on Dec 14, 2021

> using a language model on top of character level OCR

But if you know you're going to use a language model after the OCR, then you don't OCR to a single character, but rather to a distribution of character similarity (e.g. the 90% least similar or clipping at a certain similarity threshold). Then the language model should have more to work with (although TBH its work becomes more complicated).

georgecmu · on Dec 14, 2021

If a dictionary satisfies your definition of a language model, yes, with predictably poor results[1]. If I understand correctly, Google Books approach[2] represented a major improvement in accuracy of automated OCR (and this is for printed text!), but I would venture to say that implementing a language model like this would be far beyond the scope of a 'tiny project'.

[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

[2] https://tesseract-ocr.github.io/docs/Improving_Book_OCR_by_A...

twic · on Dec 14, 2021

What could possibly go wrong? https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

ronenlh · on Dec 15, 2021

Crazy:

“ The error does occur because image segments, that are considered as identical by the pattern matching engine of the Xerox scan copiers, are only saved once and getting reused across the page. If the pattern matching engine works not accurately, image segments get replaced by other segments that are not identical at all, e.g. a 6 gets replaced by an 8.”

Swenrekcah · on Dec 14, 2021

I actually just now bought their subscription with the particular idea to use it with my reMarkable.

Then I only now realised I don’t actually want my notes public, I hope there is some form of access control built into this! :D

ineedasername · on Dec 14, 2021

some form of access control built into this!

It would also be great if you could delete a page by taking a picture of the notebook entry completely scribbled out, or a video of tossing it into the fire. We need a product roadmap for this!

tdehaene · on Dec 16, 2021

You can also do something similar but with seq2seq models instead of decoder-only models: https://blog.ml6.eu/ocr-correction-with-byt5-5994d1217c07

kragen · on Dec 14, 2021

This is cool, but what's GPT-3's pricing model and roadmap?

appwiz · on Dec 14, 2021

Pricing is at https://openai.com/api/pricing/

kragen · on Dec 14, 2021

Thanks!

thakoppno · on Dec 15, 2021

Envelope math:

$0.25 to produce the same number of words as Shakespeare.