Open-source RAG infrastructure.Every team I talk to has the same experience: RAG works in the demo, breaks in production.
We handle ingestion through retrieval with optimizations baked in. 97.9% on HotpotQA vs 88.8% for standard RAG. Model-agnostic, 22+ file types, built-in citations, MCP server. MIT licensed.
4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
For large context (up to 100K tokens in some cases). We found that GPT-5:
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
Ya, the original commenter likely does not work in the space - hence the ask.
> the evaluation of new models is actually very quantitative.
While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.
Yeah, I've done it with industry-specific acronyms and this works well. Generate a list of company names and other terms it gets wrong, and give it definitions and any other useful context. For industry jargon, example sentences are good, but that's probably not relevant for company names.
Feed it that list and the transcript along with a simple prompt along the lines of "Attached is a transcript of a conversation created from an audio file. The model doing the transcription has trouble with company names/industry terms/acronyms/whatever else and will have made errors with those. I have also attached a list of company names/etc. that may have been spoken in the transcribed audio. Please review the transcription, and output a corrected version, along with a list of all corrections that you made. The list of corrections should include the original version of the word that you fixed, what you updated it to, and where it is in the document." If it's getting things wrong, you can also ask it to give an explanation of why it made each change that it did and use that to iterate on your prompt and the context you're giving it with your list of words.
I've had some luck with this in other contexts. Get the initial transcript from STT (e.g. whisper), then feed that in to gemini with a prompt giving it as much extra context as possible. For example "This is a transcript from a youtube video. It's a conversation between x people, where they talk about y and z. Please clean up the transcript, paying particular attention to company names and acronyms."
You typically add a lot of metadata with each chunk text to be able to filter it, and do to include in the citations. Injecting metadata means that you see what metadata adds helpful context to the LLM, and when you pass the results to the LLM you pass them in a format like this:
Quite a decent hit. Local models don't perform very well in long contexts. We're planning to support a local-only offline set-up for people to host w/o additional dependencies
reply