The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
I agree completely with your point, especially the difficulty of developing the user's mental model for what's going on with context and the need to move away from chat UX. It's interesting that there are still few public examples of non-chat UIs that make context management explicit. It's possible that the big names tried this and decided it wasn't worth it -- but from comments here it seems like everyone that has built a production RAG system has come to the opposite conclusion. I'm guessing the real reason is otherwise: likely for the consumer apps controlling context (especially for free users) and inference time is one of the main levers for cost management at scale. Private RAGs, on the other hand, are more concerned with maximizing result quality and minimizing time spent by employee on a particular problem with cost per query much less of a concern --- that's been my experience at least.
I wish there was more info on the article about actual customer usage - particularly whether it improved process efficiency. It's great to focus on the technical aspects of system optimization but unless this translates to tangible business value it's all just hype.
It's described in the remainder of the point - they use an LLM to generate additional search queries, either rephrasings of the user's query or bringing additional context from the chat history.
The article raises several interesting points, but I find its claim that Claude Code relies primarily on grep for code search unconvincing. It's clear that Claude Code can parse and reason about code structure, employing techniques far beyond simple regex matching. Since this assumption underpins much of the article's argument, it makes me question the overall reliability of its conclusions a bit.
Or am I completely misunderstanding how Claude Code works?
There are commercial OCR offerings from the big cloud providers (plus, like, Adobe). In my experience they generally outperform anything open-weights, although there's been a lot of improvement in VLMs in the past year or two.
Now if someone mails or faxes you that spreadsheet? You're screwed.
Spreadsheets are not the biggest problem though, as they have a reliable 2-dimensional grid - at worst some cells will be combined. The form layouts and n-dimensional table structures you can find on medical and insurance documents are truly unhinged. I've seen documents that I struggled to interpret.
To be fair, this is problematic for humans too. My old insurer outright rejected things like that stating it's not legible.
(I imagine it also had the benefit of reducing fraud/errors).
In this day and age, it's probably easier/better to change the process around that as there's little excuse for such shit quality input. I understand this isn't always possible though.
My base expectation is that the proprietary OCR models will continue to win on real-world documents, and my guess is that this is because they have access to a lot of good private training data. These public models are trained on arxiv and e-books and stuff, which doesn't necessarily translate to typical business documents.
As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)
I will agree that the first sentence is ill-considered (if nothing else it could've been broken up with one extra phrase), but it's really the worst offender. In my opinion the prose in rest of the article is perfectly reasonable. For anyone interested in the subject matter but discouraged by this awkward beginning, I would urge you to press on.
Labels are so essential - even if you're not training anything, being able to quickly and objectively test your system is hugely beneficial - but it's a constant struggle to get them. In the unlikely event you can get budget and priority for an SME to do the work, communicating your requirements to them (the need to apply very consistent rules and make few errors) is difficult and the resulting labels tend to be messy.
More than once I've just done labeling "on my own time" - I don't know the subject as well but I have some idea what makes the neurons happy, and it saves a lot of waiting around.
I've found tuning large models to be consistently difficult to justify. The last few years it seems like you're better off waiting six months for a better foundation model. However, we have a lot of cases where big models are just too expensive and there it can definitely be worthwhile to purpose-train something small.
It seems the manufacturer of the camera didn't even know (at least in the part of the org communicating with the NTSB) that their storage was encrypted. In any case the media recovered were from testing/non-dive environments, and during an actual dive footage would presumably be recorded directly to the onboard computers (which were irrecoverably destroyed).
Oceangate should take the blame for a lot of things but probably not this.
The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
reply