Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Building an app that extracts key information from PDFs + highlights citations.

You provide a PDF and a JSON schema defining what to extract, and it returns the extracted values, the citations and their precise locations in the document.

This is especially valuable in workflows where verification of LLM extracted information is critical (e.g. legal and finance). It can handle complex layouts like multiple columns, tables and also scanned documents.

Planning to offer this both as an API and a self-hosted option for organizations with strict data privacy requirements.

Screenshot: https://superdocs.io/highlight.png

Feel free to get in touch for a demo.



I've been working on some structured OCR tools recently (in the context of reading resume pdfs and allowing much more useful search filters over them than our ATS system allows) and I've found Gemini with structured outputs capable of doing a fantastic job. I'm curious, do you have any rough pointers for how to do this self-hosted?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: