Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).

For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).

AMA



For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.

1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?

2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.

3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.


Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.


Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.


Thank you for sharing.


Out of interest - Why are you looking to move from Kendra? I havent heard much good about it but keen to understand what your issues with it are.


It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.


If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.


Sorry, but I’m in the process of building my own and can’t provide any info due to regulations. Best of luck.


we have had great success with apache pdfbox.


I am confused by the benchmarks you provided.

(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?

(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]

I got very different numbers: mean_correctness_score 3.941 mean_relevancy_score 0.826 mean_faithfulness_score 0.980

You reported: mean_correctness_score 3.874 mean_relevancy_score 0.844 mean_faithfulness_score 0.667

Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.

[1] https://github.com/run-llama/llama-hub/tree/main/llama_hub/l...


(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?


One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?


For PPT, chuncking 'per page' work often quite well. With LlamaParse this will mean splitting on the "\n---\n" page separator token.


I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?


Does it work with other filetype converted into PDFs? For example docx, ppt, png, etc.


Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)


Cool! Which OCR engine/model do you use?


EasyOCR, may switch to paddleOCR in the future.


You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.


How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.


I unfortunately haven't had time to benchmark against more than tesseract.


That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.


Thanks for sharing.


PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).


Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?


Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)


Can it detect and strip out advertisements?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: