More

pierre · 2025-10-24T07:15:48 1761290148

Demo: https://olmocr.allenai.org/ Paper: https://arxiv.org/abs/2510.19817

pierre · 2025-07-25T13:44:24 1753451064

Main issue is that token are not equivalent across provider / models. With huge disparity inside provider beyond the tokenizer model:

- An image will take 10x token on gpt-4o-mini vs gpt-4.

- On gemini 2.5 pro output token are token except if you are using structure output, then all character are count as a token each for billing.

- ...

Having the price per token is nice, but what is really needed is to know how much a given query / answer will cost you, as not all token are equals.

alexellman · 2025-07-25T14:10:28 1753452628

yeah I am going to add an experiment that runs everyday and the cost of that will be a column on the table. It will be something like summarize this article in 200 words and every model gets the same prompt + article

bigiain · 2025-07-26T02:11:55 1753495915

For me, and I suspect a lot of other HN readers, a comparison/benchmark on a coding task would be more useful. Something small enough that you can affordably run it every day across a reasonable range of coding focused models, but non trivial enough to be representative of day to day AI assisted coding.

One other idea - for people spending $20 or $200/month for AI coding tools, a monitoring service that tracks and alerts on detected pricing changes could be something worth paying for. I'd definitely subscribe at $5/month for something like that, and I'd consider paying more, possibly even talking work into paying $20 or $30 per month.

BonoboIO · 2025-07-25T13:48:01 1753451281

On gemini 2.5 pro output token are token except if you are using structure output, then all character are count as a token each for billing.

Can you elaborate this? I don’t quite understand the difference.

rsanek · 2025-07-25T22:52:55 1753483975

I hadn't heard of this before either and can't find anything to support it on the pricing page.

https://ai.google.dev/gemini-api/docs/tokens

pierre · 2025-03-03T16:04:34 1741017874

LlamaIndex is building a platform for AI agents that can find information, synthesize insights, generate reports, and take actions over the most complex enterprise data.

We are seeking an exceptional engineer to join our growing LlamaParse team. Will work at the intersection of document processing, machine learning, and software engineering to push the boundaries of what's possible in document understanding. As a key member of a focused team, will have significant impact on our product's direction and technical architecture.

We are also hiring for a range of other roles, see our career page:

- Backend Software Engineer

- Forward Deploy Engineer

- Founding AI Engineer

- Open Source Engineer Python

- Founding Lead Product Manager

- Platform Engineer

- Senior Developer Relation Engineer

- Senior / Staff Backend Engineer

- Product Marketing Manager

nathan_douglas · 2025-03-03T19:45:41 1741031141

Hi Pierre, I see that the Platform Engineer position (which probably matches me most) says it's Hybrid. I'm very interested, but I live in Ohio. I understand sometimes things get clicked on accident, and just wanted to know if there might be an issue with this listing or if it's truly hybrid and the one you posted is remote, etc. Don't want to gum up the works :)

shadoweos · 2025-03-06T07:54:17 1741247657

You mention Product Manager but the role isn't mentioned on the career page.

pierre · 2025-02-28T06:49:27 1740725367

If you want to try agentic parsing we added support for sonnet-3.7 agentic parse and gemini 2.0 in llamaParse. cloud.llamaindex.ai/parse (select advanced options / parse with agent then a model)

However this come at a high cost in token and latency, but result in way better parse quality. Hopefully with new model this can be improved.

pierre · on Jan 3, 2025

This is a nice UI for end users, however it seems to be a seems wrapper on top of mutool, which is distributed as AGPL. If you want to process PDF locally, legally and safely you should use their CLI instead.

cess11 · on Jan 3, 2025

How did you figure that out? Couldn't it be Poppler as well?

pierre · on Jan 3, 2025

I read the output header, and see the Artifex (mutools / gs team) headers

cess11 · on Jan 3, 2025

Alrighty, that's a smoking gun.

pierre · on Sept 22, 2024

Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).

The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)

However this model will get better and we may soon have a good pdf to md model.

fzysingularity · on Sept 22, 2024

We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)

- VLMs are way better at handling layout and context where OCR systems fail miserably

- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing

- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference

In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.

yigitkonur35 · on Sept 22, 2024

I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!

fzysingularity · on Sept 22, 2024

Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).

Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.

authorfly · on Sept 22, 2024

What about combining old school OCR with GPT visual OCR?

If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.

yigitkonur35 · on Sept 22, 2024

You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.

fkilaiwi · on Sept 22, 2024

what paper are you referring to?

perrywky · on Sept 23, 2024

I guess this: https://arxiv.org/html/2409.01704v1

pierre · on July 26, 2024

yes, you can pass an array of path to the extract function.

pierre · on July 3, 2024

Good article, but what is the alternative? What can you build today as a software engineer that can have impact? Nothing seems to come close to AI / AI infra, even of its hard / risky / a moving landscape.

mlsu · on July 3, 2024

I would almost invert that statement. Sorry if this comes off ranty, but what exactly are people doing in the "AI space" currently that isn't "undifferentiated spam/chatbot" being sold to non-techies who heard about AI on NPR? What are real people using "AI" for that is so insanely valuable today? How much "company Y: same product with a chat window, sparks emoji" do we all need before this thing levels out and we all take a breather on the hype?

jdross · on July 3, 2024

personally?

- writing and refactoring code. probably 50 times a day now - improving documentation across the company - summarizing meetings automatically with follow ups - drafting most legal work before a lawyer edits (saved 70% on legal bills) - entity extraction and data cleanup for my users

mlsu · on July 3, 2024

Put a number on it. How much value of this will they capture from you personally (we'll assume, very very charitably by the sound of it, that you represent an "average" user of AI products) when this market matures? Exactly how much will your employer pay for a meeting summarizer? $10/mo a seat, $20/mo a seat, $50/mo a seat? Could the product sustain a 5x, 10x, 50x price hike that is going to have to happen to recoup the investment being made today?

fhd2 · on July 3, 2024

Agreed. Even if right now this seems like stuff companies want to throw money at for novelty/FOMO related reasons, I think eventually reality ought to catch up.

Probably an unpopular opinion, but I think the most efficient companies of the future will tackle the ironies of automation effectively: Carefully designing semi automation that keeps humans in the loop in a way that maximises their value - as opposed to just being bored rubber stamping the automation without really paying attention.

per1Peteia · on July 3, 2024

bingo

marcosdumay · on July 3, 2024

I'd say that if your team needs a meeting summarizer, your team has a meeting problem.

It's a clutch that will help you cope with the problem. But the real value is on fixing the actual issue.

fragmede · on July 3, 2024

I'd say if you're not using a meeting summarizer, you're wasting someone's time by having them write up notes. if you're not writing up notes, you're wasting someone else's time recapping the meeting for them. meeting notes are a 1 (meeting):many relationship for conveying information as to what was discussed. how else do you go back and see what the one person on the storage team talked to your the person on your team who left last week about so you can go into the next meeting with them prepared?

marcosdumay · on July 3, 2024

If your meeting produces "notes", and those are relevant for people that were not in it, you are doing it wrong.

If your meeting is aimed at producing "general understanding", it's already a dangerous one, and the understanding should go to the correct documentation (what is best done during the meeting). Otherwise, it should produce "focused understanding" between a few people and with immediate application.

If all you take from it is notes, well, I'm really sure that your team won't go digging through meetings notes every time they need to learn about some new context. Meeting notes are useful for CYA only, and if people feel safe they'll be filled directly at /dev/null.

swalsh · on July 3, 2024

Going to be vague, but I'm using it to scale out human processes in ways I couldn't using humans (because they cost too much) or regular code (because it's unstructured). Early results are promising, we've found a bunch of stuff which has been buried... and is potentially worth millions. Not a chat wrapper, just breathing new light into our regular old business.

Art9681 · on July 3, 2024

What do you consider "AI"? Because machine learning models have been deployed in enterprise systems for years. Video processing, security, data labeling, sentiment analysis. The sexiest one I can think of in recent memory is nVidia DLSS.

mlsu · on July 3, 2024

Broadly, what marketing is saying is “AI.” There is huge value being created with deep learning today on internal systems. Recommenders, machine translation, computational photography… it is huge, improves people's lives, drives revenue.

None of that is marketed as "AI." It's just a thing the computer does. The single most valuable application of deep learning so far (content recommenders) is a cultural phenomenon, but it’s not referred to as “AI” but rather “the algorithm.”

hnlmorg · on July 3, 2024

Everything we build has some kind of impact.

At risk of getting philosophical, I’d ask yourself what your goals actually are if you feel only AI can have the impact you desire.

fhd2 · on July 3, 2024

Not sure why this is down voted, that is the key question. Impact means different things to people. Could be:

1. Building a sustainable business and making decent money

2. Building a market leader and making ludicrous amounts of money

3. Advancing the state of the art in technology

4. Helping people with their little daily struggles

5. Solving pressing problems humanity is facing

Or many other things I suppose. Now if you believe that AI is eventually going to make anything humans can build now redundant, that'd be a reason to believe nothing else matters in the end I suppose. But even if we get there, there's a lot of road leading to that destination. Any step provides value. Software built today can provide value even if nobody is going to need it ten years from now. And it's not like you could even predict that.

mirekrusin · on July 3, 2024

The motive is to get acquired in most cases. It’s obvious and starts to make sense when you see startup that has no feasible monetisation strategy on the horizon, yet they exist and get funding. They’re betting on building infra to be hopefully used in large corp and this is their demo/PoC.

lionkor · on July 3, 2024

A lot of things, if you're okay with not chasing the next hype bubble

jononor · on July 3, 2024

Anything SaaS that solves a painpoints for established industries. Those that have billions of turnaround for decades already, are not good at building tech themselves, and buy solutions/services to run their business. Bonus for low barriers to entry. Agriculture, logistics, real estate, energy, etc.

swalsh · on July 3, 2024

I have a theory that the days of established businesses that don't know tech is dwindling. A lot of companies which has adopted tech has started building a small foundation of talent internally. I think you're seeing this trend accelerate with the large tech companies laying people off. I have heard about top grade data science talent landing at some small sized health plan.

My companies fastest growing competitor is "internally sourced departments" of the services we provide.

nextworddev · on July 3, 2024

You confirm my observation as well. Even motel chains have developers building internal tools these days

jononor · on July 3, 2024

Yes computer savvyness is on the rise, and have been for decades, and this will continue. But there are many levels: Ability to be a competent user, and competent buyer, ability to build it themselves. Then there are big difference in buy vs build culture. And preferences for type of solutions that are default buy vs default build. And finally, smaller and medium sized organizations are less likely to have internal teams. All this should be analyzed for the specific market, product and customer segment one targets.

oivey · on July 3, 2024

Slightly different take than some of the siblings: you can still just build this stuff. If your goal is impact, maybe the best place to do it will be at a cloud vendor or other big corp. If your goal is actually just a big VC exit, then maybe not.

If your product is something that can be ripped off in 3 months, then it probably wasn’t going to have a long term impact anyway.

ianpurton · on July 3, 2024

It's fine to be in AI.

My takeaway from the article is instead of being a Gen AI startup be a Gen AI startup for a specific use case.

_el1s7 · on July 3, 2024

If you don't know what to build, you don't build.

rsynnott · on July 3, 2024

Define ‘impact’. Does ‘impact’ here mean ‘tickles the fancy of a 2024-era VC’? If so, you may be right. If used in its common meaning, absolutely not; most of this stuff is ~useless.

ehnto · on July 3, 2024

All the same stuff, to be honest. If AI is set to replace human work, well we have had a cheap human labour market for decades and yet we still need software. An LLM can't replace a business itself, which is made up of niche processes, direction and purpose, which we sometimes codify into a SaaS. We'll still need to do all that even if AI replaces some of the human parts of the business.

notamy · on July 3, 2024

> What can you build today as a software engineer that can have impact?

Quite a bit, if you don’t follow the standard tech hype. Find an industry that isn’t tech-first and you’ll notice that there’s a lot of room for improvement.

astronautas · on July 3, 2024

Data infra?

pierre · on May 30, 2024

RAG cli from llamaindex, allow you to do it 100% locally when used with ollama or llamacpp instead of OpenAI.

https://docs.llamaindex.ai/en/stable/getting_started/starter...

homarp · on May 30, 2024

and at some point (https://github.com/ggerganov/llama.cpp/issues/7444) you will be able to use Phi-3-vision https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

but for now you will have to use python.

You can try it here https://ai.azure.com/explore/models/Phi-3-vision-128k-instru... to get an idea of its OCR + QA abilities

nl · on May 31, 2024

Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.

rspoerri · on May 31, 2024

i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.

i'm trying to get a similar system setup on my computer.

nl · on May 31, 2024

This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.

pierre · on May 31, 2024

Locally you can choose pypdf or mupdf wich are good but not perfect. If you can send your data online llamaparse is quite good.

j45 · on May 31, 2024

Pulling the text out of the PDFs correctly and independently is correct.

jd3 · on May 30, 2024

basically, still the same answer(s) from

https://news.ycombinator.com/item?id=38759877

https://news.ycombinator.com/item?id=36832572

tspann · on May 31, 2024

https://milvus.io/docs/integrate_with_llamaindex.md

Pretty easy to run local and lightweight with Milvus Lite with LlamaIndex

ekianjo · on May 30, 2024

llamaindex has an horrible API, very poor docs and is constantly changing. I do not recommend it.

papichulo2023 · on May 31, 2024

Any alternative?

hm-nah · on May 31, 2024

Vanilla python

dmd · on May 31, 2024

So your solution to “I don’t like flying [specific airline]” would be “how about a big pile of aluminum and some jet fuel”?

vladsanchez · on May 31, 2024

LOL `papichulo`? Que tigre!?

papichulo2023 · on June 2, 2024

Jaja la primera palabra que se me vino a la cabeza

pierre · on Feb 20, 2024

I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).

For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).

AMA

pryelluw · on Feb 21, 2024

For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.

1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?

2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.

3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.

analyte123 · on Feb 21, 2024

Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.

asukla · on Feb 22, 2024

Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.

pryelluw · on Feb 22, 2024

Thank you for sharing.

redactive · on Feb 21, 2024

Out of interest - Why are you looking to move from Kendra? I havent heard much good about it but keen to understand what your issues with it are.

pryelluw · on Feb 21, 2024

It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.

mahmoudimus · on Feb 21, 2024

If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.

pryelluw · on Feb 21, 2024

Sorry, but I’m in the process of building my own and can’t provide any info due to regulations. Best of luck.

johnthescott · on Feb 22, 2024

we have had great success with apache pdfbox.

brendanashworth · on Feb 21, 2024

I am confused by the benchmarks you provided.

(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?

(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]

I got very different numbers: mean_correctness_score 3.941 mean_relevancy_score 0.826 mean_faithfulness_score 0.980

You reported: mean_correctness_score 3.874 mean_relevancy_score 0.844 mean_faithfulness_score 0.667

Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.

[1] https://github.com/run-llama/llama-hub/tree/main/llama_hub/l...

freezed8 · on Feb 21, 2024

(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?

chasd00 · on Feb 20, 2024

One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?

pierre · on Feb 21, 2024

For PPT, chuncking 'per page' work often quite well. With LlamaParse this will mean splitting on the "\n---\n" page separator token.

hpylieva · on Feb 23, 2024

I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?

a2code · on Feb 20, 2024

Does it work with other filetype converted into PDFs? For example docx, ppt, png, etc.

pierre · on Feb 21, 2024

Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)

binarymax · on Feb 20, 2024

Cool! Which OCR engine/model do you use?

pierre · on Feb 20, 2024

EasyOCR, may switch to paddleOCR in the future.

vikp · on Feb 20, 2024

You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

raffraffraff · on Feb 20, 2024

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.

vikp · on Feb 21, 2024

I unfortunately haven't had time to benchmark against more than tesseract.

kergonath · on Feb 21, 2024

That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.

pryelluw · on Feb 21, 2024

Thanks for sharing.

joaquincabezas · on Feb 21, 2024

PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).

helloericsf · on Feb 21, 2024

Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?

pierre · on Feb 21, 2024

Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)

amelius · on Feb 20, 2024

Can it detect and strip out advertisements?