Hacker News new | past | comments | ask | show | jobs | submit login

I tested it out with a bunch of personal documents. Results were disappointing. Did not match up with the promised scores, not even slightly.

I think the traditional approach to scanning and classifying without AI/ML is the way to go, for the next 5 years at very least.




https://cloud.google.com/use-cases/ocr

For my use cases, this already beats all "traditional approaches" for at least a few month now. That's just inferring from when I first stumbled across it. No clue for how long it's been a thing.


have you tried Azure's OCR? https://learn.microsoft.com/en-us/azure/cognitive-services/c.... Is it comparable to Google's?


I did some OCR tests on some 1960s era documents (all in English). Mix of typed and handwritten. I had as results:

Google Vision: 95.62% HW - 99.4% Typed

Amazon Texttract: 95.63% HW - 99.3% Typed

Azure: 95.9% HW - 98.1% Typed

Then if curious, TrOCR was the best FOSS solution at 79.5% HW and 97.4% Typed. (However it took roughly 200x longer than Tesseract which was 43% HW and 97.0% Typed)


When did you do this test? I don't have any numbers handy, but a couple years ago I compared google's OCR vs AWS's on "text in the wild" pictures. AWS' wasn't bad, but it was definitely outperformed by the google one. The open-source solutions I tried (tesseract and some academic deep-learning code) were far behind.


This was a couple months ago now, so not that long ago. For OCR I have found that it highly depends on the type of image you are looking at. In my case these were all scanned documents of good but not great scan quality, all in English. I expect if you were talking about random photos with text in them, you'd see the FOSS solutions do much worse, and much more variance in the Google vs Amazon vs Azure. I would be curious about the academic deep learning one you tried.


The main one was https://github.com/JaidedAI/EasyOCR, mostly because, as promised, it was pretty easy to use, and uses pytorch (which I preferred in case I wanted to tweak it). It has been updated since, but at the time it was using CRNN, which is a solid model, especially for the time - it wasn't (academic) SOTA but not far behind that. I'm sure I could've coaxed better performance than I got out of it with some retraining and hyperparameter tuning.


Interesting. I tried easyOCR, I found on handwriting it was about 35%, on typed it was 95.7%, so not bad at all with typed, but for handwriting pretty bad. I focused on Tesseract and TrOCR since it wasn't working out that well, still could easily have just been my particular use case.

I also tested paddleocr and keras ocr to round them all out.

At some point I really need to finish my project enough to write up some blog articles and post a bunch of code repos for others to use.


Do they feed your data to recaptcha ?


I did not check. I also never checked if they share my mails on google search with you -- but I trust their ambition to not be sued into the ground for doing something immensely stupid.

Leaking sensitive data of enterprise customers as training material for public recaptchas falls in that category.


Google has scanned 40 million+ physical books and magazines

Which it used OCR to produce digital text.

So one source of training data at least.


What would you recommend for classifying documents? Most of the companies I've evaluated market their product as using fancy AI/ML, but instead they have hundreds of people, usually in India, manually classifying the documents.


I strongly believe everything just has to go through OpenAI or Anthropic, for now. These models are significantly better than any NLP models I try swapping in.

But this isn’t much help if you must classify images.


For documents which are mostly pretty clean you are probably right. The ceiling for AI/ML is definitely higher though, and very useful right now if you know specifically what type of document you expect to look at, but expect it to be messy.


Do you (or anyone else) know which would be a good open source OCR for PDFs and images?


OCRmyPDF is the typical answer: https://github.com/ocrmypdf/OCRmyPDF

It uses Tesseract under the hood. Results tend to just be OK in my experience.


Tesseract is amazing. It is simple and generally good results.


I was just playing with tesseract last week (I'd used it years ago) and wasn't too happy. I had a pretty simple pdf that was in what you could think of as an old typewritten font, but easily legible, and I got all kinds of word fragments and nonsense characters in the output. I know that high quality ocr systems include a language model to coerce the read text into the most probable words. Is tesseract just supposed to be the first stage of such a system?

I'll note that when I put the tesseract output into chatgpt and prompted it saying it was ocr'd text and asking to clean it up, it worked very well.


I was just processing a document with tesseract & ocrmypdf, and two things:

My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.

Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...

I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...


FWIW, I'm using Google's ML Kit which runs completely on-device and doesn't send the documents to Google. It works better than tesseract for my use case.

I did a presentation on the topic recently: https://clis-everywhere.k8s.best/16

I'll soon make the stack open source, but it shouldn't be hard to recreate given the inputs I've already provided.


I could never get tesseract to give good output. It baffles me when people say it's good. Do I need to tune it somehow or something?


Tessearct is generally the overall best for typed documents, though it struggles with handwriting. TrOCR is better than Tesseract, especially with handwriting, but requires a GPU to have any speed. Tesseract from my tests was roughly 200X faster than TrOCR (not an exaggeration)


When I was evaluating options a few months ago I found https://github.com/PaddlePaddle/PaddleOCR to be a very strong contender for my use case (reading product labels), but you'll definitely want to put together some representative docs/images and test a bunch of solutions to see what works for you.


Did you finally select it? Our team is using PaddleOCR with good results! We wrote an article (https://medium.com/adevinta-tech-blog/text-in-image-2-0-impr...) about it.


Not open but free - Apple’s


Which Apple product?


If you want to OCR specific text you can use Textsniper on Mac and draw a box on whatever part of the screen you want to capture. I'm guessing under the hood it's just using Apple's OCR tech, which does work very well (at least if you're on Apple Silicon, it's not quite so fast on my 2015 Intel Macbook Pro)


Two places I use it: Preview on my Mac, photos on my phone. Haven’t seen an api yet.



VNRecognizeTextRequest and DataScannerViewController


easyocr was the best of the bunch - however it still leaves quite a bit to be desired.


Developments in this space are coming really fast, and reading words are squarely within the capabilities of neural engines. 5 years is a very long time in AI years.


How does it compare to something like https://docalysis.com?


I think the online demos have been fine tunes to work on receipts.


Five years? Where’s that number coming from?

This seems like exactly the kind of problem that will see rapid improvements as people point more LLMs at multimodal input.

Right now making predictions for ML capabilities on a five year timeframe seems foolhardy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: