I'm part of the team that build LlamaParse. It's net improvement compare to othe...

pryelluw · on Feb 21, 2024

For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.

1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?

2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.

3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.

analyte123 · on Feb 21, 2024

Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.

asukla · on Feb 22, 2024

Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.

pryelluw · on Feb 22, 2024

Thank you for sharing.

redactive · on Feb 21, 2024

Out of interest - Why are you looking to move from Kendra? I havent heard much good about it but keen to understand what your issues with it are.

pryelluw · on Feb 21, 2024

It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.

mahmoudimus · on Feb 21, 2024

If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.

pryelluw · on Feb 21, 2024

Sorry, but I’m in the process of building my own and can’t provide any info due to regulations. Best of luck.

johnthescott · on Feb 22, 2024

we have had great success with apache pdfbox.

brendanashworth · on Feb 21, 2024

I am confused by the benchmarks you provided.

(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?

(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]

I got very different numbers: mean_correctness_score 3.941 mean_relevancy_score 0.826 mean_faithfulness_score 0.980

You reported: mean_correctness_score 3.874 mean_relevancy_score 0.844 mean_faithfulness_score 0.667

Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.

[1] https://github.com/run-llama/llama-hub/tree/main/llama_hub/l...

freezed8 · on Feb 21, 2024

(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?

chasd00 · on Feb 20, 2024

One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?

pierre · on Feb 21, 2024

For PPT, chuncking 'per page' work often quite well. With LlamaParse this will mean splitting on the "\n---\n" page separator token.

hpylieva · on Feb 23, 2024

I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?

a2code · on Feb 20, 2024

Does it work with other filetype converted into PDFs? For example docx, ppt, png, etc.

pierre · on Feb 21, 2024

Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)

binarymax · on Feb 20, 2024

Cool! Which OCR engine/model do you use?

pierre · on Feb 20, 2024

EasyOCR, may switch to paddleOCR in the future.

vikp · on Feb 20, 2024

You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

raffraffraff · on Feb 20, 2024

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.

vikp · on Feb 21, 2024

I unfortunately haven't had time to benchmark against more than tesseract.

kergonath · on Feb 21, 2024

That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.

pryelluw · on Feb 21, 2024

Thanks for sharing.

joaquincabezas · on Feb 21, 2024

PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).

helloericsf · on Feb 21, 2024

Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?

pierre · on Feb 21, 2024

Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)

amelius · on Feb 20, 2024

Can it detect and strip out advertisements?