FWIW, having written a simple RAG system from "scratch" (meaning not using frameworks or api calls), it's not more complicated than doing it this way with langchain etc.
This post is mostly about plumbing. It's probably the right way to do it if it needs to be scaled. But for learning, it obscures what is essentially simple stuff going on behind the scenes.
For serious implementations, frameworks are not very helpful, even LangChain. All the components provide good SDKs/APIs, having a bunch of "integrations" doesn't add any real value.
If you know what you want to build, building from scratch is easier than you think. If you're tinkering on the weekend, then maybe the frameworks are helpful.
Yeah as soon as we write the word 'thread' or thinking about LLM API concurrency control across many user requests, all frameworks we tried are basically a wall instead of an accelerator. For a single user demo video on Twitter or a low-traffic streamlit POC to get a repo with lots of star gazers, they work quite well, and that's not far from what someone needs for an internal project with a small userbase. Just once this is supposed to be infra for production-grade software, the tools we have tried so far are still prioritizing features over being a foundation.
My favorite example is the asana loader[0] for llama-index. It's literally just the most basic wrapper around the Asana SDK to concatenate some strings.
What brings to my attention in this article is the section named "Cold Start", where it generates questions based on a provided context.
I think it is a good way to cheaply generate an Q&A dataset that can later be used to finetune a model.
But the problem is that it generates some questions and answers of bad quality. All generated examples have issues:
- "What is the context discussing about?" - which context?
- "The context does not provide information on what Ray Tune is." - Not an answer
- "The context does not provide information on what external library integrations are." - same as before
I could only think of manual review to remove these noise questions. Any ideas on how to improve this QA generation? I've tried it before, but with paltry results.
I recently quit my job to build specialized tooling in this space. We’re broadly focusing on eval in general, but are starting with high quality question and answer generation for testing these kinds of RAG pipelines. It’s surprisingly hard!
Sounds very interesting. I am building an open-source LLM building platform (agenta.ai) and looking for eval approaches to integrate for our users. Do you have already a product/api that we could use?
No, you don't need a vector database. You can get OK results by prompting "give me ten search terms that are relevant to this question", then running those searches against a regular full-text search engine and pasting those results back into the LLM as context along with the original question.
You're likely to get better results from vector-based semantic search though, just because it takes you beyond needing exact matches on search terms.
Vector is better for some use cases (open-domain, more conversational data) and term-based search is better for others (closed-domain, more keyword-based).
I've found that internal enterprise projects tend to be very keyword based, and vector search often produces weird, head-scratcher results that users hate - whereas term-based search does a better job of capturing the right terms, if you do the proper synonym/abbreviation expansions.
That said, I use them both, usually with vector search as a fallback after the initial keyword-based RAG pass
The context length is limited, for gpt-3.5 it's 4k tokens, there are other offerings which offer upto 100k(claude). 100k tokens is ~1 book., but it priced steeply for each call. It's often wiser, cheaper to Retrieve the context from your text & Augment your query to the LLM to Generate more contextual answers. That's the reason for the name Retrieval Augmented Generation (RAG).
For Retrieving - you'd need a vector database (for similarity comparison you can use semantic or vector embedding based similarity search)
Minor note: you only need a vector database if you have so many possible inputs that linear retrieval is too slow.
Arguably, for many use cases (e.g. searching through a document with ~200 passages), loading embeddings in memory and running a simple linear search would be fast enough.
Yeah what you mentioned might be true. Currently our understanding on how LLMs really work behind the screens is limited. For example, there was a recent research[1] where LLM's accuracy is better if the context is added at the beginning when compared to the end of the prompt. So it's mostly by trial & error to figure out what works out best for you. You can use FAISS or similar to have the embeddings in-memory instead of a full fledged vector DB. But pg vector is convenient plugin if you already have postgres instance running
Zilliz just published an article comparing QPS (queries per second) with pg vector vs. Milvus. The results are clear - Milvus, a database designed ground-up for handling vector indexes, outperformed in terms of speed and latency. Dive into the details here. https://zilliz.com/blog/getting-started-pgvector-guide-devel...
Full disclosure, I just joined Zilliz this week as a Dev Advocate.
What I mentioned doesn't depend on how LLMs work, the end result is the same (retrieving useful inputs to pass to your LLM).
Just meant that a lot of people can just do this in-memory or in ad-hoc ways if they're not too latency constrained.
I think unless you need a vector db definitely don't use it.
A vector storage could help in reduce the time it takes to retrieve the most similar hit. I used faiss as a local vector store quite a bit to retrieve vectors fast. Though I had 1.5 million vectors to work through.
Interesting. I thought anything >1million would need a vector db to scale on production. What was your machine config for running faiss? Also did you plan for redundancy or was it just faiss as a service VM?
Others have chimed in as well, but I'll mention that we've been live with our product, for all users, for several months now doing RAG with OpenAI vector embeddings stored in Redis.
We then just fetch up to the vectors related to a customer's schema in memory (largest is ~200MB) and run cosine similarity in a few ms in Go (handwritten, ~25 lines of code), and then we've got out top N things to place in our prompt.
Primitive? You betcha. Works extremely well for our entire customer base? Yup. You definitely don't need a Vector DB unless you have an enormous amount of vectors. For us it means having to run our own Redis clusters, but we know how to do that, and so we don't need to involve another vendor.
RAG is a very useful flow but I agree the complexity is often overwhelming, esp as you move from a toy example to a real production deployment. It's not just choosing a vector DB (last time I checked there were about 50), managing it, deciding on how to chunk data, etc. You also need to ensure your retrieval pipeline is accurate and fast, ensuring data is secure and private, and manage the whole thing as it scales. That's one of the main benefits of using Vectara (https://vectara.com; FD: I work there) - it's a GenAI platform that abstracts all this complexity away, and you can focus on building your application.
You need a vector db because all the vector db companies need customers...
You definitely do need information retrieval. It just shouldn't be limited to vector dbs. Unfortunately vector db companies and the VCs that back them have flooded the internet with propaganda suggesting vector db is the only choice.
https://colinharman.substack.com/p/beware-tunnel-vision-in-a...
For most serious use cases, you'll have far too much data to fit into 1 (or several) inference contexts.
Petroni 2020 got pretty far with TFDIF iirc, for a related but slightly different task, still I’ve got to believe the semantic search element provided by vector DBS is going to add a lot
Kudos to the team for a very detailed notebook going into things like pipeline evaluation wrt performance and costs etc. Even if we ignore the framework specific bits, it is a great guide to follow when building RAG systems in production.
We have been building RAG systems in production for a few months and have been tinkering with different strategies to get the most performance out of these pipelines. As others have pointed out, vector database may not be the right strategy for every problem. Similarly there are things like lost in the middle problems (https://arxiv.org/abs/2307.03172) that one may have to deal with. We put together our learnings building and optimizing these pipelines in a post at https://llmstack.ai/blog/retrieval-augmented-generation.
https://github.com/trypromptly/LLMStack is a low-code platform we open-sourced recently that ships these RAG pipelines out of the box with some app templates if anyone wants to try them out.
While you don't strictly "need" a vector db to do RAG, as others have pointed out, vector databases excel when you're dealing with natural language - which is ambiguous.
This will be the case when you're exposing an interface to end users that they can submit arbitrary queries to - such as "how do I turn off reverse breaking".
By converting the user's query to vectors before sending it to your vector store, you're getting at the user's actual intent behind their words - which can help you retrieve more accurate context to feed to your LLM when asking it to perform a chat completion, for example.
This is also important if you're dealing with proprietary or non-public data that a search engine can't see. Context-specific natural language queries are well suited to vector databases.
Ehhh, I don't think you're telling the whole story here. Vectors aren't really a complete solution here either. Consider a use case like ours where we need to support extremely vague inputs (since users give us extremely vague inputs): https://twitter.com/_cartermp/status/1700586154599559464/
Cosine similarity across vectors isn't enough here, but when combined with an LLM we get the right behavior. As you mention, without the vector store reducing the size of data we pass to the LLM, hallucinations happen more often. It's a balancing act.
The other nasty one to consider is when people write "how do I not turn off reverse breaking". Again, a comparison will show that as very similar to your input, but it's really the opposite. And so if implementers aren't careful to account for that, they've now got a nasty subtle bug on their hands.
A neat way of dealing with sparse input is to take the entire chat history (if any) into account and ask the LLM to expand the query so that the semantic search has more to work with. Generally, using the LLM to add more data to the user query based on context, previous conversation, or just having it produce a fake document all together based on the sparse query can work well to improve the vectors you use in the similarity search. A concern with this strategy is latency, as you need to add another generation hop before you can query the vector db.
My question is: if I want to use LLM to help me sift through a large amount of structured data, say for example all the logs for a bunch of different applications from a certain cloud environment, each with their own idiosyncrasies and specific formats (many GBs of data), can the RAG pattern be useful here?
Some of my concerns:
1) Is sentence embedding using an off-the-shelf embedding model going to capture the "meaning" of my logs? My answer is "probably not". For example, if a portion of my logs is in this format
Will I be able to get meaningful embeddings that satisfy a query such as "what components in my system exhibited an anomalously high latency lately?" (this is just an example among many different queries I’d have)
Based on the little I know, it seems to me off-the-shelf embeddings wouldn't be able to match the embedding of my query with the embeddings for the relevant log lines, given the complexity of this task.
2) Is it going to be even feasible (cost/performance-wise) to use embeddings when one has a firehose of data coming through, or is it better suited for a mostly-static corpus of data (e.g. your typical corporate documentation or product catalog)?
I know that I can achieve something similar with a Code Interpreter-like approach, so in theory I could build a multi-step reasoning agent that starting from my query and the data would try to (1) discover the schema and then (2) crunch the data to try to get to my answer, but I don't know how scalable this approach would effectively be.
Just to clarify - are you wanting the LLM itself to identify what a "anomalous latency" would be based on the data itself? If so then I don't think this will help you at all until we can actually fit the log into the context.
What RAG here is doing is using embeddings and a vector store to identify close pieces of information, for example "in this django project add a textfield" will be very close to documentation in the django docs that say "textfield", and it will then add that to the prompt so the LLM has the relevant docs in its context.
The problem is that you'll need a heuristic to identify at least "potentially anomalous" and even then you'll still have to make sure there's enough context for it to know "is this a normal daily fluctuation".
A multi-step agent is definitely what you want, you could have it build an SQL query itself, for example "was there any high latency requests yesterday?" it may identify it should filter the time, possibly design the query to determine what is "high".
---
At the moment I don't think it's well suited to identifying when the "latency is abnormally high". However if you have some other system/human identify heuristics to feed to the LLM, it may then be able to do at least answer the query.
I was trying to understand if there is an opportunity to introduce some of this technology to solve “anomaly detection” on large amount of structured data, where anomaly might be an incredibly overloaded term (it might imply a performance regression, a security issue, etc). That is a business need I have today.
It seems that what is possible today is an assistant that can aid a user to get to these answers faster (by, for instance, suggesting a SQL query based on the schema, etc). Again, roughly the equivalent of what Code Interpreter does, just without the local environment limitations.
From your questions it looks like you are only interesting in the R part. RAG implies the retrieval step is then used to augment a user prompt.
To answer 1, a good heuristic would be "can a human reasonably familiar with the terminology answer questions about the meaning?" If a human would need extra info to make sense of your data then so would an LLM.
This is where RAG typically comes in. For example if you had documentation about ClassName and FunctionName, a retrieval model might be able to find the most likely candidates based on a file containing full definitions of these classes and function, then pass that info into the LLM appended to your query.
For 2: It depends if the fire house is the query or the data. If you have queries coming in very quickly, then you might be able to if your firehose doesn't have too much volume since you can batch requests and get responses fairly quickly.
If the fire hose is the data going into the vector DB then you might have some difficultly inserting and indexing the data fast enough.
For this kind of structured data and this kind of structured queries, it may be more useful to stick to a data query language (SQL, or some analytics engine).
Thanks. I wonder if a reasonable approach could then be to first insert the data in a datawarehouse-like database suitable for analytics, and then use an LLM application to (1) generate SQL queries that could answer my question, reasoning about the schema (2) potentially summarize the output result set. It could still result in a significant boost of productivity.
Indeed, that is a promising path. Fundamentally you still want to reply on a human to figure out what analytics are interesting to consider, then having the LLM act as a helper that generates queries corresponding to the analytics.
Wow this was indeed super comprehensive. A few things I noticed:
- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'
- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?
- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model
- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.
- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?
Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.
Anyscale consistently posts great projects. Very cool to see the cost comparison and quality comparison. Not surprising to see that OSS is less expensive, but also rated as slightly lower quality than gpt-3.5-turbo.
This post is mostly about plumbing. It's probably the right way to do it if it needs to be scaled. But for learning, it obscures what is essentially simple stuff going on behind the scenes.