yeah I am going to add an experiment that runs everyday and the cost of that will be a column on the table. It will be something like summarize this article in 200 words and every model gets the same prompt + article
For me, and I suspect a lot of other HN readers, a comparison/benchmark on a coding task would be more useful. Something small enough that you can affordably run it every day across a reasonable range of coding focused models, but non trivial enough to be representative of day to day AI assisted coding.
One other idea - for people spending $20 or $200/month for AI coding tools, a monitoring service that tracks and alerts on detected pricing changes could be something worth paying for. I'd definitely subscribe at $5/month for something like that, and I'd consider paying more, possibly even talking work into paying $20 or $30 per month.
LlamaIndex is building a platform for AI agents that can find information, synthesize insights, generate reports, and take actions over the most complex enterprise data.
We are seeking an exceptional engineer to join our growing LlamaParse team. Will work at the intersection of document processing, machine learning, and software engineering to push the boundaries of what's possible in document understanding. As a key member of a focused team, will have significant impact on our product's direction and technical architecture.
We are also hiring for a range of other roles, see our career page:
Hi Pierre, I see that the Platform Engineer position (which probably matches me most) says it's Hybrid. I'm very interested, but I live in Ohio. I understand sometimes things get clicked on accident, and just wanted to know if there might be an issue with this listing or if it's truly hybrid and the one you posted is remote, etc. Don't want to gum up the works :)
If you want to try agentic parsing we added support for sonnet-3.7 agentic parse and gemini 2.0 in llamaParse. cloud.llamaindex.ai/parse (select advanced options / parse with agent then a model)
However this come at a high cost in token and latency, but result in way better parse quality. Hopefully with new model this can be improved.
This is a nice UI for end users, however it seems to be a seems wrapper on top of mutool, which is distributed as AGPL. If you want to process PDF locally, legally and safely you should use their CLI instead.
We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)
- VLMs are way better at handling layout and context where OCR systems fail miserably
- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing
- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference
In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.
I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!
What about combining old school OCR with GPT visual OCR?
If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.
You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.
Good article, but what is the alternative? What can you build today as a software engineer that can have impact? Nothing seems to come close to AI / AI infra, even of its hard / risky / a moving landscape.
I would almost invert that statement. Sorry if this comes off ranty, but what exactly are people doing in the "AI space" currently that isn't "undifferentiated spam/chatbot" being sold to non-techies who heard about AI on NPR? What are real people using "AI" for that is so insanely valuable today? How much "company Y: same product with a chat window, sparks emoji" do we all need before this thing levels out and we all take a breather on the hype?
- writing and refactoring code. probably 50 times a day now
- improving documentation across the company
- summarizing meetings automatically with follow ups
- drafting most legal work before a lawyer edits (saved 70% on legal bills)
- entity extraction and data cleanup for my users
Put a number on it. How much value of this will they capture from you personally (we'll assume, very very charitably by the sound of it, that you represent an "average" user of AI products) when this market matures? Exactly how much will your employer pay for a meeting summarizer? $10/mo a seat, $20/mo a seat, $50/mo a seat? Could the product sustain a 5x, 10x, 50x price hike that is going to have to happen to recoup the investment being made today?
Agreed. Even if right now this seems like stuff companies want to throw money at for novelty/FOMO related reasons, I think eventually reality ought to catch up.
Probably an unpopular opinion, but I think the most efficient companies of the future will tackle the ironies of automation effectively: Carefully designing semi automation that keeps humans in the loop in a way that maximises their value - as opposed to just being bored rubber stamping the automation without really paying attention.
I'd say if you're not using a meeting summarizer, you're wasting someone's time by having them write up notes. if you're not writing up notes, you're wasting someone else's time recapping the meeting for them. meeting notes are a 1 (meeting):many relationship for conveying information as to what was discussed. how else do you go back and see what the one person on the storage team talked to your the person on your team who left last week about so you can go into the next meeting with them prepared?
If your meeting produces "notes", and those are relevant for people that were not in it, you are doing it wrong.
If your meeting is aimed at producing "general understanding", it's already a dangerous one, and the understanding should go to the correct documentation (what is best done during the meeting). Otherwise, it should produce "focused understanding" between a few people and with immediate application.
If all you take from it is notes, well, I'm really sure that your team won't go digging through meetings notes every time they need to learn about some new context. Meeting notes are useful for CYA only, and if people feel safe they'll be filled directly at /dev/null.
Going to be vague, but I'm using it to scale out human processes in ways I couldn't using humans (because they cost too much) or regular code (because it's unstructured). Early results are promising, we've found a bunch of stuff which has been buried... and is potentially worth millions. Not a chat wrapper, just breathing new light into our regular old business.
What do you consider "AI"? Because machine learning models have been deployed in enterprise systems for years. Video processing, security, data labeling, sentiment analysis. The sexiest one I can think of in recent memory is nVidia DLSS.
Broadly, what marketing is saying is “AI.” There is huge value being created with deep learning today on internal systems. Recommenders, machine translation, computational photography… it is huge, improves people's lives, drives revenue.
None of that is marketed as "AI." It's just a thing the computer does. The single most valuable application of deep learning so far (content recommenders) is a cultural phenomenon, but it’s not referred to as “AI” but rather “the algorithm.”
Not sure why this is down voted, that is the key question. Impact means different things to people. Could be:
1. Building a sustainable business and making decent money
2. Building a market leader and making ludicrous amounts of money
3. Advancing the state of the art in technology
4. Helping people with their little daily struggles
5. Solving pressing problems humanity is facing
Or many other things I suppose. Now if you believe that AI is eventually going to make anything humans can build now redundant, that'd be a reason to believe nothing else matters in the end I suppose. But even if we get there, there's a lot of road leading to that destination. Any step provides value. Software built today can provide value even if nobody is going to need it ten years from now. And it's not like you could even predict that.
The motive is to get acquired in most cases. It’s obvious and starts to make sense when you see startup that has no feasible monetisation strategy on the horizon, yet they exist and get funding. They’re betting on building infra to be hopefully used in large corp and this is their demo/PoC.
Anything SaaS that solves a painpoints for established industries. Those that have billions of turnaround for decades already, are not good at building tech themselves, and buy solutions/services to run their business. Bonus for low barriers to entry. Agriculture, logistics, real estate, energy, etc.
I have a theory that the days of established businesses that don't know tech is dwindling. A lot of companies which has adopted tech has started building a small foundation of talent internally. I think you're seeing this trend accelerate with the large tech companies laying people off. I have heard about top grade data science talent landing at some small sized health plan.
My companies fastest growing competitor is "internally sourced departments" of the services we provide.
Yes computer savvyness is on the rise, and have been for decades, and this will continue. But there are many levels: Ability to be a competent user, and competent buyer, ability to build it themselves. Then there are big difference in buy vs build culture. And preferences for type of solutions that are default buy vs default build. And finally, smaller and medium sized organizations are less likely to have internal teams. All this should be analyzed for the specific market, product and customer segment one targets.
Slightly different take than some of the siblings: you can still just build this stuff. If your goal is impact, maybe the best place to do it will be at a cloud vendor or other big corp. If your goal is actually just a big VC exit, then maybe not.
If your product is something that can be ripped off in 3 months, then it probably wasn’t going to have a long term impact anyway.
Define ‘impact’. Does ‘impact’ here mean ‘tickles the fancy of a 2024-era VC’? If so, you may be right. If used in its common meaning, absolutely not; most of this stuff is ~useless.
All the same stuff, to be honest. If AI is set to replace human work, well we have had a cheap human labour market for decades and yet we still need software. An LLM can't replace a business itself, which is made up of niche processes, direction and purpose, which we sometimes codify into a SaaS. We'll still need to do all that even if AI replaces some of the human parts of the business.
> What can you build today as a software engineer that can have impact?
Quite a bit, if you don’t follow the standard tech hype. Find an industry that isn’t tech-first and you’ll notice that there’s a lot of room for improvement.
i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.
i'm trying to get a similar system setup on my computer.
I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.
1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?
2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.
3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.
Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.
Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.
It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.
If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.
(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?
(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]
I got very different numbers:
mean_correctness_score 3.941
mean_relevancy_score 0.826
mean_faithfulness_score 0.980
You reported:
mean_correctness_score 3.874
mean_relevancy_score 0.844
mean_faithfulness_score 0.667
Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.
Thanks for running through the benchmark! Just to clarify some things:
(1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.
(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?
One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?
I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?
Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)
You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.
How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.
PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).
Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)