Do you (or anyone else) know which would be a good open source OCR for PDFs and ...

lelandfe · on May 29, 2023

OCRmyPDF is the typical answer: https://github.com/ocrmypdf/OCRmyPDF

It uses Tesseract under the hood. Results tend to just be OK in my experience.

xrd · on May 29, 2023

Tesseract is amazing. It is simple and generally good results.

version_five · on May 29, 2023

I was just playing with tesseract last week (I'd used it years ago) and wasn't too happy. I had a pretty simple pdf that was in what you could think of as an old typewritten font, but easily legible, and I got all kinds of word fragments and nonsense characters in the output. I know that high quality ocr systems include a language model to coerce the read text into the most probable words. Is tesseract just supposed to be the first stage of such a system?

I'll note that when I put the tesseract output into chatgpt and prompted it saying it was ocr'd text and asking to clean it up, it worked very well.

alright2565 · on May 29, 2023

I was just processing a document with tesseract & ocrmypdf, and two things:

My first time processing it, I used `ocrmypdf --redo-ocr` because it looked like there was some existing OCR. After processing, the OCR was crap because ocrmypdf didn't realize it was OCR but thought it was real text in the document that should be kept. This was fixable using `ocrmypdf --force-ocr`.

Before realizing this, I discovered that Tesseract 4 & 5 use a neural network-based recognition. I then came across this step-by-step guide on fine-tuning Tesseract for a specific document set: https://www.statworx.com/en/content-hub/blog/fine-tuning-tes...

I didn't end up following the fine-tuning process because at this point `ocrmypdf --force-ocr` worked excellently, but I thought the draw_box_file_data.py script from their example was particularly useful: https://gist.github.com/flaviut/d901be509425098645e4ae527a9e...

denysvitali · on May 29, 2023

FWIW, I'm using Google's ML Kit which runs completely on-device and doesn't send the documents to Google. It works better than tesseract for my use case.

I did a presentation on the topic recently: https://clis-everywhere.k8s.best/16

I'll soon make the stack open source, but it shouldn't be hard to recreate given the inputs I've already provided.

dataflow · on May 29, 2023

I could never get tesseract to give good output. It baffles me when people say it's good. Do I need to tune it somehow or something?

driscoll42 · on May 29, 2023

Tessearct is generally the overall best for typed documents, though it struggles with handwriting. TrOCR is better than Tesseract, especially with handwriting, but requires a GPU to have any speed. Tesseract from my tests was roughly 200X faster than TrOCR (not an exaggeration)

kcorbitt · on May 29, 2023

When I was evaluating options a few months ago I found https://github.com/PaddlePaddle/PaddleOCR to be a very strong contender for my use case (reading product labels), but you'll definitely want to put together some representative docs/images and test a bunch of solutions to see what works for you.

joaquincabezas · on May 29, 2023

Did you finally select it? Our team is using PaddleOCR with good results! We wrote an article (https://medium.com/adevinta-tech-blog/text-in-image-2-0-impr...) about it.

wahnfrieden · on May 29, 2023

Not open but free - Apple’s

occamrazor · on May 29, 2023

Which Apple product?

vosper · on May 29, 2023

If you want to OCR specific text you can use Textsniper on Mac and draw a box on whatever part of the screen you want to capture. I'm guessing under the hood it's just using Apple's OCR tech, which does work very well (at least if you're on Apple Silicon, it's not quite so fast on my 2015 Intel Macbook Pro)

imaurer · on May 29, 2023

Two places I use it: Preview on my Mac, photos on my phone. Haven’t seen an api yet.

momo93 · on May 29, 2023

https://developer.apple.com/documentation/vision/recognizing...

wahnfrieden · on May 29, 2023

VNRecognizeTextRequest and DataScannerViewController

bobosha · on May 29, 2023

easyocr was the best of the bunch - however it still leaves quite a bit to be desired.