Hacker News new | past | comments | ask | show | jobs | submit login
Pdfsandwich (tobias-elze.de)
354 points by blacksqr on Nov 5, 2021 | hide | past | favorite | 63 comments



Using Pdfsandwich in college was like having a superpower. We would often be given PDFs with only image data. While my peers were still scrolling through and copying quotes by hand, I was there in seconds with Ctrl-F to find and copy/paste.

Once you have text in the PDF, you can use any sort of text analysis tools. You can use tools to convert it to plain text and grep through, or anything else you want.

That being said, it's not perfect, but still pretty awesome. Sometimes the spacing was off or it would confuse symbols like 1, I, or l. But these are minor and usually only on poorly scanned PDFs.


Here's a tangentially related fun fact! Before you take the uniform bar examination in New York, you first have to take an at-home section called the New York Law Examination. There is a book[1] that covers all the New York specific law that could be on the examination. It used to be provided as a simple PDF, where you could potentially search it, but people seemed to feel it made the test too easy - since everything was in that book. So they made it an image PDF.

Then, they put a warning on their site saying that they explicitly consider it to be misconduct search the book (e.g., by making the book searchable with OCR):

> The NYLC/NYLE Course Materials are locked in a non-searchable format in accordance with the Board’s misconduct rule prohibiting candidates from electronically searching the Course Materials when taking the NYLE. If a candidate, because of a disability, uses a screen reader to access written material, please contact the Board office by phone (518-453-5990), mail, or fax.

Kind of silly, sure! They don't want you searching the book during the exam, but they're fine with you going through it. And, following passing the bar exam, there is a character and fitness process, so students are fairly terrified of doing anything unethical, particularly things the bar explicitly says is unethical. So, it's basically an honor system, but with a big stick (although I haven't heard of any actual enforcement). If you OCR it, brag about it to your friends, and your friends really hate you, I guess they could report it.

[1]https://www.nybarexam.org/Content/NewYorkCourseMaterials.pdf


As mentioned in your quote there’s a special version intended for disable candidates with larger text and searchable content so as to be compatible with screen readers. The version is freely available (not publicly but indexed by search engines) on their website.


On an even more macro level I've had a great experience with ripgrep-all[0], which uses Tesseract internally.

I have e.g. a directory with all weekly lecture slides for one lecture, and can directly find where (both file and page) we learned something related to photosynthesis via `rga photoshynthesis`.

[0]: https://github.com/phiresky/ripgrep-all


rga is one of my favorite tools ever.


I've read books that were scanned with OCR. You learn to deal with typical errors that occur, but overall the texts were mostly accurate.


Interesting. I've used ocrmypdf for this a lot - https://ocrmypdf.readthedocs.io/en/latest/


I tried both recently on a scanned, color document recently. pdfsandwich gave me really unpleasant, monochrome, and blown out results; ocrmypdf did what I expected, giving me a searchable pdf.


Looks great.. I'm just trying to find the Windows binary?



This Docker app looks useful, thank you.


And people say Linux doesn't have any programs :P


I've been using ocrmypdf on Windows through WSL very successfully. Works perfectly.


Looks like there isn't any


It's great and I use it almost every day: paperless-ng uses it in the background.


I have to keep hundreds of documents I receive each year for five or ten years, some indefinitely: invoices, statements, contracts, etc. Quite a few of these documents are 1 to 2 cm thick. My accountant organizes them in folders sorted by the source of the the document and collected into separate file cabinets and draws.

During covid, I stopped having my accountant come to my home office to do this work. After a great deal of experimentation, I resorted to purchasing an expensive Epson (approximately $600) document scanner that reliably does full duplex scans very quickly. The documents are stamped with a serial number (e.g. 2021-207) and stored in a single directory on my system with a file name that corresponds to the serial number (i.e. 2012-207.pdf).

The Epson hardware is great, the software is just barely adequate. It processes one document at a time, and the Epson application's OCR running on a M1 Mac mini can't keep up with the scanner and slows the whole process down somewhat. I would like to batch process the scanned documents in the background to convert the pdfs generated by the scanner into "searchable" pdfs. (Pdf files produced by scanners have an image layer but no text layer underneath it. Optional OCR done post-scan then adds the text layer.) I've tried a number of other OCR applications, one of the best is Adobe Acrobat Pro; it has the Adobe Pro kind of price, unfortunately, and does a million things I don't need.

Back to my filing system. I keep each years physical documents in one file drawer sorted by serial number. Because the documents are stamped with a serial number before being scanned, I can always find the physical document easily if I am looking at the pdf. Furthermore, because my pdf's are searchable I can quickly locate a bill or a tax document by a relevant name or even a particular amount (like, where did this $1,808.17 discrepancy come from).

Is this perfect, no far from it. Many little irritations afflict the actual process. Many statements have large amounts of small barely readable disclosures and footnotes, sometimes in faint small fonts. This is largely useless, slows down the OCR, and increases the file sizes. Barcodes, QRCodes, and DataMatrix codes often appear on the first page of these documents or even every page of the documents. It would be great if these were somehow scanned and used to tag the documents. The Epson software insists on embedding spaces in the generated file names, doesn't allow me to use auto generated ISO dates in file names, that makes working with the files from the command line less than ideal. (File names like "2021-Aug-07 122.pdf" are user friendly but not some friendly for scripts or sort commands.) I use several different configurations for the scanner, the software supports it, but I have to pay attention to pick landscape and double-sided when needed.

Thank you HN for the many suggested solutions to the OCR issues. It gives my hope that I'll be able to wire together something better than I've got now.


I've been a fan of k2pdfopt [1] for years. Single binary, command line + optional gui. Lets you slice and optimize pdfs in any imaginable way, e.g. for e-readers' screens. I think it also does what pdfsandwich does, if I understand things correctly [2].

That said, pdfsandwich's 'one thing well' approach does have an appeal. I will definitely try it out, thanks for posting. Something in its "logo" reminded me of the OpenBSD fish. :)

1: https://www.willus.com/k2pdfopt/

2: https://www.willus.com/k2pdfopt/help/ocr.shtml


Could someone explain what this does? The explanation paragraph does not make sense to me grammatically.


As far as I understand, it takes a PDF that contains only image data (e.g. a scan of a price of paper) and uses OCR to recognize the text, then overlays the text on top of the image in the output PDF.

It would allow you to take a physically scanned document and create a PDF with selectable text you could copy+paste, search over, etc.


Text is behind the image, as the first 'graph of TFA notes:

the text will be added to each page invisibly "behind" the images.


It calls a series of open source tools that result in producing a pdf with text embedded behind an image overlay, where the image overlay is the original pdf. It was a while ago where I really looked into this but to name a few:

ImageMagick to convert the pages to images

Tesseract-ocr by Google to transcribe the text in the images, which puts it’s output into singular pdf files

Pdfunite to stitch together the pdfs back into a whole file

I’m sure I’m missing a few, iirc it can call a tool that straightens the pages as well.

EDIT: Messed around and remembered the stuff:

where a.pdf is a 2 page PDF:

>convert a.pdf a.png

makes a-0.png and a-1.png

OCR's each image:

>for x in {0..1} ; do tesseract a-$x.png a_ocr-$x PDF ; done ;

combines them into 1 PDF:

>pdfunite a_ocr-{0..1}.pdf a_ocr_combined.pdf


It will make the “text” on a pdf with only images searchable and selectable.


It converts images to text.

The input is a scanned PDF. The output is the same PDF with the recognized text on top, in a transparent font.

Copy and paste now works because when you click the PDF you are selecting the transparent text.


I came here to say this. I read it 4 times before giving up.


I wonder if we (like in "humanity") can live without PDF at all? The format has so many flaws and problems it is practically unusable for any task except producing a paper copy, which is progressively obsolete with all the displays and e-paper readers and mobile phones.


PDF is quite good at producing static output that will render predictably, and there aren't really any common alternatives for this. PDF becomes very problematic when you attempt to do anything else with it though, like edit it or parse it as input.


Or like reading it on a screen size other than a printer paper.


yeah, consistency is very important in design. imagine sending your boss PSD or HTML files but their machines don't have fonts you use in your design.


Just like with PDF, you can embed fonts in HTML so everyone has the same font.

But yeah, agree with your overall point, PDF is currently the easiest, most well-supported way of sending stuff that doesn't fudge around with the design. What you send someone is almost guaranteed to be what they see, unless they use some weird PDF reader.


you can embed fonts in HTML so everyone has the same font.

How often do people embed fonts in the html via a font-face + base64 ttf or woff data-uri?


An epic thing you can do with PDF though is show an image of the font you want to show, and have a super common font like Helvetica hanging out behind it, so the end-user doesn't even need to render the font, it's just an image.


The idea of "getting away from paper" seems to be so challenging for many in its implementation that at that point they stopped thinking any further.

I work in a soewhat big corp. and PDF is viewed as the basis for the digitization of existing documents, be it for archiving or for the exchange of data with business partners. Also, we still have a fax machine in the company.

But, this is not just the opinion in my company. The public service also envisions sending and receiving digital documents using PDF as the future. E.g.: at the end of my digital tax return everything is summarized in a pdf (great), which then has to be signed and sent by post (lol).


Including security flaws, which is a big deal. I still have not figured it out how to sanitize pdfs in any reasonable way. (If someone has any suggestions here, please share.)

Not to mention need to save file in original Adobe Reader ("do you really want to overwrite this file?') every time you add a comment.


Maybe this is good enough for your use case:

https://dangerzone.rocks/

It converts documents to just images, then converts those back to PDF, all in a sandbox.


Thanks a lot! (Though it does not work for me.)


And don't even get me started on sustainability. I wonder how much trees we lose yearly because printing is the most convenient way to interact with a PDF document.


A paper copy is still one of the most fun ways to read


Not for me, since I got my first e-ink reader. Convenience and portability and access to content blows all the tactile and olfactory joys out of the water.


Years ago I tried tesseract and similar tools to do this but switched to using ABBYY FineReader because the OCR accuracy was way higher and into "usable". Can anyone offer a recent comparison?


i have a question about OCR. is there a project that lets you define fields for doing ocr, i am thinking scanning invoices and defining that this line means the invoice number, this here means the item name, item rate, etc.

many OCR software can scan this but not "understand" it to use it. ABBYY has something like this for scanning invoices so is there something for the foss folks?


While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.



Windows only, but power automate lets you do this.


Not to get all technical, but the "Cube Rule of Food" would classify this project as "pdftoast", not "pdfsandwich":

https://cuberule.com/


I'd classify it more as a salad. https://saladtheory.github.io/ (some very interesting content there, and includes a section refuting the cube rule)


Pdfcalzone I think. Pop tarts appear to be explicitly called out as calzones towards the bottom of the page ( enveloping the filling on all sides, though maybe there is a n alternative interpretation if you consider frosting to be the point of reference).


> Salad

> Popular examples

> Steak

I love it


It’s kind of funny to open this site on an iPad and have iOS just let you select text from images (it does its own on-device OCR now), but I wish the layering described was standard on any PDF generation software (for indexing purposes).


I made a "cheap" frontend for testing out tesseract in multiple languages. It runs in the frontend via wasm and doesn't send any of the text anywhere.

Currently pdf is not supported, but I've considered adding that in.

https://testeract.netlify.app/


I am using scantools:

https://kebekus.gitlab.io/scantools/image2pdf/

Also worth a look, works great and also supports PDF/A and jbig for a nice compression of b/w scans


Really interesting project.

The source code seems to be on sourceforge.net. I site once important, but now when I see it I either think "the project is most-likely dead" or "can this project be legit? Am I getting malware here?"


> The source code seems to be on sourceforge.net. I site once important, but now when I see it I either think "the project is most-likely dead" or "can this project be legit? Am I getting malware here?"

This is such a sad thing for me. When I was a kid, Sourceforge.net is where I would always go first to look for software, because I knew that it was a reputable host and that open-source projects participated in a culture of greater respect for users than most freeware projects demonstrate.


Sourceforge changed hands and as far as I can tell, the new owners put some effort into getting rid of all the crap and even into protecting users from crap added by the original authors.

Filezilla, for example, initially participated in SourceForge's adware program under the old owner, but after the owner change, actually provided a clean version on Sourceforge (while the version on the website was and still is adware-bundled).


> Latest version is 0.1.7 (August 10, 2018).

Not saying anything from 2018 isn't valuable... I'm currently working on PDF scraping tools, and my lord, the stuff we're left to work with is abysmal... this could be state of the art.


SourceForge literally looks like one of those your-free-filez.ru style movie sites with all the fake download buttons. Same aesthetic. Just awful lol


A few years ago the company was sold and started bundling their own crap with the downloads, but apparently it was sold again and stopped doing that.


I remember this is one of those sites where you're not sure which one is the actual download button to click.


Adobe Acrobat X Pro is excellent in this regard, but propietary and all that


Similar, perhaps more feature-rich tool is OCRmyPDF https://github.com/jbarlow83/OCRmyPDF


I'm wondering, is it possible to make the invisible text layer visible and editable somehow? Would be great for proof reading and editing incorrect characters.


i have used this extensively and i love this software. you just supply a file and it crunches the numbers and you get an output.

there is something called scantailor if you are scanning books yourself. that gives you more contrl over the orientation and margins and ocr and contrast controls. that said, pdfsandwich gives you a miniscule file size which i could not achieve otherwise.


PSA: Latest macOS on Apple Silicon just lets you copy anything from images so I assume works on pdf's too.


ocrmypdf is a really good alternative and has been updated throughout the years. I highly recommend it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: