> There is a lot of excitement around retrieval augmented generation or “RAG.” Roughly the idea is: some of the deficiencies in current generative AI or large language models (LLMs) can be papered over by augmenting their hallucinations with links, references, and extracts from definitive source documents. I.e.: knocking the LLM back into the lane.
This seems like a misunderstanding of what RAG is. RAG is not used to try to anchor to reality a general LLM by somehow making it come up with sources and links. RAG is a technology to augment search engines with vector search and, yes, a natural language interface. This concerns, typically, "small' search engines indexing a specific corpus. It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).
RAG isn't a cure for ChatGPT's hallucinations, at all. It's a tool to improve and go past inverted indexes.
RAG really is not the right tool for that. It does not event prevent hallucinations. It is useful because it can retrieve information from documents the model never saw during its training phase and does not require a costly re-training. It is not a fine tuning either, and it cannot fundamentally change the model’s behaviour.
In my experience, it really does reduce the risk of hallucination, especially if paired with a prompt which explicitly instruct the model to use only facts from the context window.
Another strategy is to provide unique identifiers for the rag chunks dropped into the context and ask the model to cross reference in its response. You can then check that the response is exciting evidence from the context with Simple pattern match.
not OP, but for the "unique identifiers", you can think of it like the footnote style of markdown links. Most of these models are fine-tuned to do markdown well, and a short identifier is less likely to be hallucinated (my philosophy anyway), so it usually works pretty well. For the examples, something like this can work
Instead of having the LLM generate the links couldn’t you use a combination of keyword matching and similarity on the model output and the results to automatically add citations? You could use a smaller NLP model or even a rule based system to extract entities or phrases to compare. I’m sure this is already being done by bing for example.
You definitely can do that. It’s just sometimes simpler to dump lots of stuff in context and then check it wasn’t made up.
It like the idea of using markdown footnotes. I think that would word well - ChatGPT does handle markdown really well.
> but that definitely is one particular use of RAG
One use of RAG doesn't imply that all uses of RAG are for grounding LLM hallucinations.
Additionally, I disdain when fallacies come up in conversation when it doesn't appear someone is making an argument in bad faith. We all use logical fallacies by accident from time to time. A good use of calling them out is when they are used in poor taste. I find that more on Reddit than Hacker News.
I have now also engaged in the Tu Quoque fallacy in this reply. See how annoying they are?
Not really. RAG works very differently from a general (or generalist?) LLM.
RAG is vector search first. It encodes the query, finds nearest vectors in the vector database, retrieves the fragments attached to those vectors, and then sends those vectors to the LLM for it to summarize them.
A general LLM like Gemini or Claude or ChatGPT first produces an answer to a question, based on its training. This doesn't involve searching any external source at that point. Then after that answer is produced, the LLM can try to find sources that match what it has come up with.
You don't have to use vector search to implement RAG. You can use other search mechanisms instead or as well - whatever it takes to populate the context with the material most likely to help answer the user's question.
Two common alternatives to vector search:
1. Ask the LLM to identify key terms in the user's question and use those terms with a regular full-text search engine
2. If the content you are answering questions about is short enough - an employee handbook for example - just jam the whole thing in the context. Claude 3 supports 200,000 tokens and Gemini Pro 1.5 supports a million so this can actually be pretty effective.
I'd definitely call 1 (the FTS version) RAG. It's how Bing and Google Gemini and ChatGPT Browse work - they don't have a full vector index of the Web to work with (at least as far as I know), they use the model's best guess at an appropriate FTS query instead.
HYDE is a related technique. Ask the model to generate a response with no context, then use this for semantic search agains actual data and respond by summarising these documents.
This is a generalization. These proprietary systems do different things at different times. With GPT4o you can see little icons when a RAG is in use or when code and tests are being used.
People, we have to stop talking about what we know as though it's all there is. Don't confuse our knowledge for understanding. Understanding only comes from repeadly trying to prove our understandings wrong and learning how things truly are.
This. There is no point in arguing amongst ourselves when we're all wearing blindfolds and it is the zookeeper who decides what parts of the proprietary elephant we're allowed to touch.
It’s right there in the name - first you Retrieve relevant information (often a vector lookup) then you use it to Augment the prompt, then you Generate an answer.
It’s bloody useful if you can’t cram your entire proprietary code base into a prompt.
> RAG is a technology to augment search engines with vector search
Based on the words in the acronym, that seems exactly backwards. The words suggest that it is a generation technology which is merely augmented by retrieval (search), not a retrieval technology that is augmented by a generative technology.
The parent's description isn't quite correct. It's kinda sorta describing the implementation; RAG is often implemented via embeddings. In practice, you generally get better results with a mix of vector and, e.g., TF-IDF.
An example of RAG could be: you have a great LLM that was trained at the end of 2023. You want to ask it about something that happened in 2024. You're out of luck.
If you were using RAG, then that LLM would still be useful. You could ask it
> "When does the tiktok ban take effect?"
Your question would be converted to an embedding, and then compared against a database of other embeddings, generated from a corpus of up-to-date information and useful resources (wikipedia, news, etc).
Hopefully it finds a detailed article on the tiktok ban. The input to the LLM could then be something like:
> CONTEXT: <the text of the article>
> USER: When does the tiktok ban take effect?
The data retrieved by the search process allows for relevant in-context learning.
You have augmented the generation of an LLM by retrieving a relevant document.
Rag does fact search and dumps content relevant to the query into the LLM’s context window.
It’s like referring to your notes before answering a question. If your notes are good you’re going to answer well (barring a weird brain fart.)
Hallucinating is still possible but extremely unlikely. And a post generation step can check for that and drop responses containing hallucinations
Yup I have this as default for a Discord bot I run in a stocks chat. If the bot is unsure then the LLM says so and the bot keks you instead of making something up
RAG, or Retrieval Augmented Generation, has been a buzzword in the AI community for some time now. While the term has gained significant traction, its interpretation varies widely among practitioners. Some argue that it should be "Reference Augmented Generation," while others insist on "Retrieval Augmented Generation." However, the real question is, does the terminology really matter, or should we focus on the underlying concepts and their applications?
The idea behind RAG is to enhance AI-powered search by leveraging the vast amount of information available in documents. It's a noble goal, but the acronym itself falls short in capturing the essence of what we're trying to achieve. It's like trying to describe the entire field of computer science with a single term - it's just not feasible.
But here's the thing: AI-powered search is not just a concept anymore; it's a reality. I've been working on this problem since 2019, and I can tell you from experience that it works. By integrating OpenAI's GPT-2 with Solr, I was able to create a search engine that could understand natural language queries and provide highly relevant results. And this is just the beginning.
The potential applications of AI-powered search are vast. From Playwright to FFmpeg, I've been applying LLMs to various services, and the results have been nothing short of impressive. But to truly unlock the potential of this technology, we need to think beyond the confines of a single acronym.
That's why I propose a new term: RAISE - Retrieval Augmented Intelligent Search Engine. This term captures the essence of what we're trying to achieve: a search engine that can understand the intent behind a query, retrieve relevant information from a vast corpus of documents, and provide intelligent, contextual responses.
But more importantly, RAISE is not just a term; it's a call to action. It's a reminder that we need to raise the bar in AI-powered search, to push the boundaries of what's possible, and to create tools that can truly revolutionize the way we access and interact with information.
So let's not get bogged down in terminology debates. Instead, let's focus on the real challenge at hand: building intelligent search engines that can understand, retrieve, and respond to our queries in ways that were once thought impossible. And who knows, maybe one day we'll look back at this moment and realize that RAISE was just the beginning of something much bigger.
RAG isn’t just search though. I can be using retrieval for generating content, but I think RAISE kind of restricts the term to question answering which is not the e only application.
There's no particular requirement for a RAG application to use vector search or embeddings, and there's no requirement that semantic similarity is in play (i.e. "retriev[ing] documents...that do not contain the terms in the search query"). Fundamentally RAG is just doing some retrieval, and then doing some generation. While the traditional implementation definitely involves vector search and LLMs, there are plenty of other approaches; ultimately at anything beyond toy scale it sort of begins to converge with the long history of traditional search problems rather than being its own distinct thing.
> It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).
I played with Latent Semantic Indexing[1] as a teen back in early 2000s, which also does kinda that. I haven't read much on RAG, I'm assuming it's some next level stuff, but are there any relations or similarities?
It is similar to the retrieval stage, yes. The concept of RAG is to then feed the retrieved fragments to a model so it can generate an answer that takes them into account.
> It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).
That’s what GPT does. Or rather, someone hearing about RAG for the first time would have trouble distinguishing what you said from their understanding of how GPTs are already trained.
Plus RAG isn't just referenced result augmentation.
It isn't just adding a vector based db
It isn't about less hallucination. In fact it doesn't even mandate that result should be referenced nor come from a vector db.
RAG's point is to remove the limit LLMs alone have which is that they are limited to the mind trained data as source of information.
A RAG can be queried for information an LLM doesn't have any knowledge of. The LLM part of a RAG can be instructed to use all sort of information retrieval, such as making a web search, checking the current stock market value of any particular tickers.
The article is nonetheless interesting as it touches on the beauty of LLMs being in their input interface rather than their ability to outputs aggregated content that reads beautifully, usually.
RAG is more than what the author says it is. It's even what the author says is most wanted.
RAG is more comprehensively what the author wants. It can be instructed to provide relevant references, always answer a particular way, and most importantly can get asked to perform (live) search engine queries to find the most up to date information. The example in the article is OK given the recipe is pretty old and has been very likely mined during training.
What about:
> I missed the games Lakers played from the 1st to 21st of April. Could you give me the list of opponents and scores please.
LLMs input interface won't help. RAGs can make LLMs answer that.
This seems like a misunderstanding of what RAG is. RAG is not used to try to anchor to reality a general LLM by somehow making it come up with sources and links. RAG is a technology to augment search engines with vector search and, yes, a natural language interface. This concerns, typically, "small' search engines indexing a specific corpus. It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).
RAG isn't a cure for ChatGPT's hallucinations, at all. It's a tool to improve and go past inverted indexes.