Hacker News new | past | comments | ask | show | jobs | submit login

It's very much possible to extract legible (as you put it) data from PDF. It's just that it does not work 100% accurately 100% of the time. (And different libraries have different levels of accuracy.) But this is due to issues inherent in the PDF format. I know this because I worked on a project to extract text from PDF, for a startup client. They first gave me the work of researching various PDF text extraction libraries, hands-on, and evaluating them, and recommending one of them for use. I did that. Then based on that recommendation and report I gave them, they gave me the further work of using that library (from C), to extract text from a bunch of PDFs of financial data (think SEC data, EDGAR and the like - it was a fintech startup), and also further work of more processing of the extracted text data with Python. As part of all that work, I found that sometimes the text extraction was inaccurate, reported some bugs in the library (xpdf) to the company that makes it (Glyph and Cog), they fixed those bugs, etc. From discussions with one of the technical founders of G&C I got to know of these inherent issues, such as characters sometimes getting transposed, junk characters creeping into the output, etc.

Also, plug: I've available for consulting work on PDF generation (using other libraries or xtopdf, my own toolkit for PDF generation from other formats), and for PDF text extraction work. Contact info in my HN profile.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: