This is really cool. Use of GPT-3 to augment OCR is an amazing (and, retrospectively, obvious) insight and a great immediate use case for these language models.
I wish Remarkable took this idea -- they really oversold their OCR capabilities[1]. It works great in their support and promo videos, but I found the actual performance to be absolutely terrible.
It's an old idea, using a language model on top of character level OCR. Works well for general text but doesn't solve random sequences of digits and letters. So you can't use it to correct your invoices where you have lots of out-of-dictionary tokens.
I've always found it somehow ironic that a human can correctly recognise printed characters even if parts of them are missing and the word is misspelt or in a language the human does not know at all, but computers have to resort to language models because an exact comparison of part of the image with other parts of the image (where the same letter is printed in the same font) for some reason is not feasible?
Humans must be using a language model for image recognition when reading though. Otherwise things like failing to spot the the is a duplicate word wouldn't happen so often.
Another interesting quirk of human reading is that it is pretty agnostic to the actual order of letters in a word as long as the first and last one are placed correctly
I think people use font models rather than just language models. The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.
> The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.
To be fair, the amount of people who write their own address incorrectly is staggering.
I'm in eCommerce and easily 70%+ of addresses have some sort of minor error in them. Around 5-10% are just plain bizarre, with things like two suburbs being included or the street name not including the Rd/Av/Dr etc.
I'd suspect this is one of those problems that seems easy in the lab but quickly degenerates when you consider the human aspect of it.
People do things like include two suburbs because they know that (eg) in order for it to reach their address correctly, it must first reach the human at point A that will correctly pass it to the human at point B, where it otherwise cannot arrive at point B because point B does not receive normal postal service, possibly because of a jurisdiction dispute in 1974 that placed their address in a zip code that’s different from all of their neighbors.
Can you tell that I know a bunch of people with these kinds of issues? You may already know that programmers mess up names all the time, telling people that their last name must be their surname, or that you can’t be an O’Reilly or a Robertson-Peele, or that “Mary Anne” is not a single name, or that your middle name cannot be your “primary” name, or that you must have more than one name, or that your legal name is invalid because it’s not the name you were born with, or that all 3yos have names, etc.
Well, take all of those issues, and add the vagaries of geography, and you’ve got mail delivery.
I have a weird address, and easily 30% of websites insist on fucking it up by applying validation rules that might make sense from 20,000 feet, but don't actually work in practice for our address.
The most straightforward of them is that some validation services insist that our ZIP code is for the next town over instead of the one we live in, which has its own post office. Nothing correct happens if our mail goes to the wrong post office because they (rightly) have no idea how to deliver mail to us.
I wouldn't be so confident that 100% of that 70% don't know their own address. For at least some of those cases, I'm willing to bet they know something you don't about the vagaries of mail and package delivery to their address.
> using a language model on top of character level OCR
But if you know you're going to use a language model after the OCR, then you don't OCR to a single character, but rather to a distribution of character similarity (e.g. the 90% least similar or clipping at a certain similarity threshold). Then the language model should have more to work with (although TBH its work becomes more complicated).
If a dictionary satisfies your definition of a language model, yes, with predictably poor results[1]. If I understand correctly, Google Books approach[2] represented a major improvement in accuracy of automated OCR (and this is for printed text!), but I would venture to say that implementing a language model like this would be far beyond the scope of a 'tiny project'.
“ The error does occur because image segments, that are considered as identical by the pattern matching engine of the Xerox scan copiers, are only saved once and getting reused across the page. If the pattern matching engine works not accurately, image segments get replaced by other segments that are not identical at all, e.g. a 6 gets replaced by an 8.”
It would also be great if you could delete a page by taking a picture of the notebook entry completely scribbled out, or a video of tossing it into the fire. We need a product roadmap for this!
I wish Remarkable took this idea -- they really oversold their OCR capabilities[1]. It works great in their support and promo videos, but I found the actual performance to be absolutely terrible.
[1] https://support.remarkable.com/hc/en-us/articles/36000266143...