Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you (or anyone else) know which would be a good open source OCR for PDFs and images?


OCRmyPDF is the typical answer: https://github.com/ocrmypdf/OCRmyPDF

It uses Tesseract under the hood. Results tend to just be OK in my experience.


Tesseract is amazing. It is simple and generally good results.


I was just playing with tesseract last week (I'd used it years ago) and wasn't too happy. I had a pretty simple pdf that was in what you could think of as an old typewritten font, but easily legible, and I got all kinds of word fragments and nonsense characters in the output. I know that high quality ocr systems include a language model to coerce the read text into the most probable words. Is tesseract just supposed to be the first stage of such a system?

I'll note that when I put the tesseract output into chatgpt and prompted it saying it was ocr'd text and asking to clean it up, it worked very well.


I was just processing a document with tesseract & ocrmypdf, and two things:

My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.

Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...

I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...


FWIW, I'm using Google's ML Kit which runs completely on-device and doesn't send the documents to Google. It works better than tesseract for my use case.

I did a presentation on the topic recently: https://clis-everywhere.k8s.best/16

I'll soon make the stack open source, but it shouldn't be hard to recreate given the inputs I've already provided.


I could never get tesseract to give good output. It baffles me when people say it's good. Do I need to tune it somehow or something?


Tessearct is generally the overall best for typed documents, though it struggles with handwriting. TrOCR is better than Tesseract, especially with handwriting, but requires a GPU to have any speed. Tesseract from my tests was roughly 200X faster than TrOCR (not an exaggeration)


When I was evaluating options a few months ago I found https://github.com/PaddlePaddle/PaddleOCR to be a very strong contender for my use case (reading product labels), but you'll definitely want to put together some representative docs/images and test a bunch of solutions to see what works for you.


Did you finally select it? Our team is using PaddleOCR with good results! We wrote an article (https://medium.com/adevinta-tech-blog/text-in-image-2-0-impr...) about it.


Not open but free - Apple’s


Which Apple product?


If you want to OCR specific text you can use Textsniper on Mac and draw a box on whatever part of the screen you want to capture. I'm guessing under the hood it's just using Apple's OCR tech, which does work very well (at least if you're on Apple Silicon, it's not quite so fast on my 2015 Intel Macbook Pro)


Two places I use it: Preview on my Mac, photos on my phone. Haven’t seen an api yet.



VNRecognizeTextRequest and DataScannerViewController


easyocr was the best of the bunch - however it still leaves quite a bit to be desired.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: