A Comprehensive Guide for Building Rag-Based LLM Applications

version_five · on Sept 14, 2023

FWIW, having written a simple RAG system from "scratch" (meaning not using frameworks or api calls), it's not more complicated than doing it this way with langchain etc.

This post is mostly about plumbing. It's probably the right way to do it if it needs to be scaled. But for learning, it obscures what is essentially simple stuff going on behind the scenes.

iampims · on Sept 14, 2023

Opportunity for you to write a blogpost about your approach :)

ajhai · on Sept 14, 2023

We did write one just yesterday that talks about rags and some techniques to improve their performance in production at https://llmstack.ai/blog/retrieval-augmented-generation

stringtoint · on Sept 15, 2023

This is a neat post, thanks for sharing!

clharman · on Sept 14, 2023

For serious implementations, frameworks are not very helpful, even LangChain. All the components provide good SDKs/APIs, having a bunch of "integrations" doesn't add any real value.

If you know what you want to build, building from scratch is easier than you think. If you're tinkering on the weekend, then maybe the frameworks are helpful.

lmeyerov · on Sept 14, 2023

Yeah as soon as we write the word 'thread' or thinking about LLM API concurrency control across many user requests, all frameworks we tried are basically a wall instead of an accelerator. For a single user demo video on Twitter or a low-traffic streamlit POC to get a repo with lots of star gazers, they work quite well, and that's not far from what someone needs for an internal project with a small userbase. Just once this is supposed to be infra for production-grade software, the tools we have tried so far are still prioritizing features over being a foundation.

jamesblonde · on Sept 14, 2023

I tend to agree - i haven't seen the value in existing "retriever" components in langchain and others.

haxton · on Sept 14, 2023

My favorite example is the asana loader[0] for llama-index. It's literally just the most basic wrapper around the Asana SDK to concatenate some strings.

[0] - https://github.com/emptycrown/llama-hub/blob/main/llama_hub/...

themanmaran · on Sept 15, 2023

Yea that's langchain in a nutshell. Their postgres wrapper was the same last time I used it was pretty much

```

  def queryPostgres(client, string)
    return client.query(string)

```

bguberfain · on Sept 14, 2023

What brings to my attention in this article is the section named "Cold Start", where it generates questions based on a provided context. I think it is a good way to cheaply generate an Q&A dataset that can later be used to finetune a model. But the problem is that it generates some questions and answers of bad quality. All generated examples have issues: - "What is the context discussing about?" - which context? - "The context does not provide information on what Ray Tune is." - Not an answer - "The context does not provide information on what external library integrations are." - same as before I could only think of manual review to remove these noise questions. Any ideas on how to improve this QA generation? I've tried it before, but with paltry results.

maxrmk · on Sept 14, 2023

I recently quit my job to build specialized tooling in this space. We’re broadly focusing on eval in general, but are starting with high quality question and answer generation for testing these kinds of RAG pipelines. It’s surprisingly hard!

resiros · on Sept 14, 2023

Sounds very interesting. I am building an open-source LLM building platform (agenta.ai) and looking for eval approaches to integrate for our users. Do you have already a product/api that we could use?

maxrmk · on Sept 15, 2023

We're in closed beta right now, but shoot me an email (max@talc.ai) and I can get you API access

pplonski86 · on Sept 14, 2023

Can it be easier to do RAG? Do we always need to have Vector DB? Why LLM can't do search through the context by itself?

simonw · on Sept 14, 2023

No, you don't need a vector database. You can get OK results by prompting "give me ten search terms that are relevant to this question", then running those searches against a regular full-text search engine and pasting those results back into the LLM as context along with the original question.

You're likely to get better results from vector-based semantic search though, just because it takes you beyond needing exact matches on search terms.

clharman · on Sept 14, 2023

Vector is better for some use cases (open-domain, more conversational data) and term-based search is better for others (closed-domain, more keyword-based).

I've found that internal enterprise projects tend to be very keyword based, and vector search often produces weird, head-scratcher results that users hate - whereas term-based search does a better job of capturing the right terms, if you do the proper synonym/abbreviation expansions.

That said, I use them both, usually with vector search as a fallback after the initial keyword-based RAG pass

chandureddyvari · on Sept 14, 2023

The context length is limited, for gpt-3.5 it's 4k tokens, there are other offerings which offer upto 100k(claude). 100k tokens is ~1 book., but it priced steeply for each call. It's often wiser, cheaper to Retrieve the context from your text & Augment your query to the LLM to Generate more contextual answers. That's the reason for the name Retrieval Augmented Generation (RAG). For Retrieving - you'd need a vector database (for similarity comparison you can use semantic or vector embedding based similarity search)

halflings · on Sept 14, 2023

Minor note: you only need a vector database if you have so many possible inputs that linear retrieval is too slow.

Arguably, for many use cases (e.g. searching through a document with ~200 passages), loading embeddings in memory and running a simple linear search would be fast enough.

chandureddyvari · on Sept 14, 2023

Yeah what you mentioned might be true. Currently our understanding on how LLMs really work behind the screens is limited. For example, there was a recent research[1] where LLM's accuracy is better if the context is added at the beginning when compared to the end of the prompt. So it's mostly by trial & error to figure out what works out best for you. You can use FAISS or similar to have the embeddings in-memory instead of a full fledged vector DB. But pg vector is convenient plugin if you already have postgres instance running

[1]- https://towardsdatascience.com/in-context-learning-approache...

bayesian_limit · on Sept 19, 2023

Zilliz just published an article comparing QPS (queries per second) with pg vector vs. Milvus. The results are clear - Milvus, a database designed ground-up for handling vector indexes, outperformed in terms of speed and latency. Dive into the details here. https://zilliz.com/blog/getting-started-pgvector-guide-devel...

Full disclosure, I just joined Zilliz this week as a Dev Advocate.

halflings · on Sept 14, 2023

What I mentioned doesn't depend on how LLMs work, the end result is the same (retrieving useful inputs to pass to your LLM). Just meant that a lot of people can just do this in-memory or in ad-hoc ways if they're not too latency constrained.

azmodeus · on Sept 14, 2023

I think unless you need a vector db definitely don't use it.

A vector storage could help in reduce the time it takes to retrieve the most similar hit. I used faiss as a local vector store quite a bit to retrieve vectors fast. Though I had 1.5 million vectors to work through.

chandureddyvari · on Sept 14, 2023

Interesting. I thought anything >1million would need a vector db to scale on production. What was your machine config for running faiss? Also did you plan for redundancy or was it just faiss as a service VM?

Tostino · on Sept 14, 2023

People seem to underestimate the scale you can get to on a single machine, and overestimate how easy it will be to go up from there.

An in memory index is about as good as it gets for a single node performance, and fitting that many vectors into memory on a single machine is easy.

phillipcarter · on Sept 14, 2023

Others have chimed in as well, but I'll mention that we've been live with our product, for all users, for several months now doing RAG with OpenAI vector embeddings stored in Redis.

We then just fetch up to the vectors related to a customer's schema in memory (largest is ~200MB) and run cosine similarity in a few ms in Go (handwritten, ~25 lines of code), and then we've got out top N things to place in our prompt.

Primitive? You betcha. Works extremely well for our entire customer base? Yup. You definitely don't need a Vector DB unless you have an enormous amount of vectors. For us it means having to run our own Redis clusters, but we know how to do that, and so we don't need to involve another vendor.

gsuuon · on Sept 14, 2023

For local stuff with a handful of documents, you can even just throw it into a json and call it a day. The similarity search is as simple as an np.dot: https://github.com/gsuuon/llm.nvim/blob/main/python3/store.p...

ofermend · on Sept 14, 2023

RAG is a very useful flow but I agree the complexity is often overwhelming, esp as you move from a toy example to a real production deployment. It's not just choosing a vector DB (last time I checked there were about 50), managing it, deciding on how to chunk data, etc. You also need to ensure your retrieval pipeline is accurate and fast, ensuring data is secure and private, and manage the whole thing as it scales. That's one of the main benefits of using Vectara (https://vectara.com; FD: I work there) - it's a GenAI platform that abstracts all this complexity away, and you can focus on building your application.

clharman · on Sept 14, 2023

You need a vector db because all the vector db companies need customers...

You definitely do need information retrieval. It just shouldn't be limited to vector dbs. Unfortunately vector db companies and the VCs that back them have flooded the internet with propaganda suggesting vector db is the only choice. https://colinharman.substack.com/p/beware-tunnel-vision-in-a...

For most serious use cases, you'll have far too much data to fit into 1 (or several) inference contexts.

petesergeant · on Sept 14, 2023

Petroni 2020 got pretty far with TFDIF iirc, for a related but slightly different task, still I’ve got to believe the semantic search element provided by vector DBS is going to add a lot

potatoman22 · on Sept 14, 2023

You can hook up any search engine to an LLM. Vector databases are just an easy* way to make a decent search engine.

ajhai · on Sept 14, 2023

Kudos to the team for a very detailed notebook going into things like pipeline evaluation wrt performance and costs etc. Even if we ignore the framework specific bits, it is a great guide to follow when building RAG systems in production.

We have been building RAG systems in production for a few months and have been tinkering with different strategies to get the most performance out of these pipelines. As others have pointed out, vector database may not be the right strategy for every problem. Similarly there are things like lost in the middle problems (https://arxiv.org/abs/2307.03172) that one may have to deal with. We put together our learnings building and optimizing these pipelines in a post at https://llmstack.ai/blog/retrieval-augmented-generation.

https://github.com/trypromptly/LLMStack is a low-code platform we open-sourced recently that ships these RAG pipelines out of the box with some app templates if anyone wants to try them out.

zackproser · on Sept 14, 2023

While you don't strictly "need" a vector db to do RAG, as others have pointed out, vector databases excel when you're dealing with natural language - which is ambiguous.

This will be the case when you're exposing an interface to end users that they can submit arbitrary queries to - such as "how do I turn off reverse breaking".

By converting the user's query to vectors before sending it to your vector store, you're getting at the user's actual intent behind their words - which can help you retrieve more accurate context to feed to your LLM when asking it to perform a chat completion, for example.

This is also important if you're dealing with proprietary or non-public data that a search engine can't see. Context-specific natural language queries are well suited to vector databases.

We wrote up a guide with examples here: https://www.pinecone.io/learn/retrieval-augmented-generation...

And we've got several example notebooks you can run end to end using our free-tier here: https://docs.pinecone.io/page/examples

phillipcarter · on Sept 14, 2023

Ehhh, I don't think you're telling the whole story here. Vectors aren't really a complete solution here either. Consider a use case like ours where we need to support extremely vague inputs (since users give us extremely vague inputs): https://twitter.com/_cartermp/status/1700586154599559464/

Cosine similarity across vectors isn't enough here, but when combined with an LLM we get the right behavior. As you mention, without the vector store reducing the size of data we pass to the LLM, hallucinations happen more often. It's a balancing act.

The other nasty one to consider is when people write "how do I not turn off reverse breaking". Again, a comparison will show that as very similar to your input, but it's really the opposite. And so if implementers aren't careful to account for that, they've now got a nasty subtle bug on their hands.

danielbln · on Sept 14, 2023

A neat way of dealing with sparse input is to take the entire chat history (if any) into account and ask the LLM to expand the query so that the semantic search has more to work with. Generally, using the LLM to add more data to the user query based on context, previous conversation, or just having it produce a fake document all together based on the sparse query can work well to improve the vectors you use in the similarity search. A concern with this strategy is latency, as you need to add another generation hop before you can query the vector db.

brandall10 · on Sept 14, 2023

Interesting. Do you have specific examples or a link to a post detailing this?

danielbln · on Sept 14, 2023

The approach is based on hypothetical document embeddings (HyDE). Here is a good description of it in the context of langchain: https://python.langchain.com/docs/use_cases/question_answeri...

The original paper proposing this technique can be found here: https://arxiv.org/pdf/2212.10496.pdf

deanmoriarty · on Sept 14, 2023

My question is: if I want to use LLM to help me sift through a large amount of structured data, say for example all the logs for a bunch of different applications from a certain cloud environment, each with their own idiosyncrasies and specific formats (many GBs of data), can the RAG pattern be useful here?

Some of my concerns:

1) Is sentence embedding using an off-the-shelf embedding model going to capture the "meaning" of my logs? My answer is "probably not". For example, if a portion of my logs is in this format

  timestamp_start,ClassName,FunctionName,timestamp_end

Will I be able to get meaningful embeddings that satisfy a query such as "what components in my system exhibited an anomalously high latency lately?" (this is just an example among many different queries I’d have)

Based on the little I know, it seems to me off-the-shelf embeddings wouldn't be able to match the embedding of my query with the embeddings for the relevant log lines, given the complexity of this task.

2) Is it going to be even feasible (cost/performance-wise) to use embeddings when one has a firehose of data coming through, or is it better suited for a mostly-static corpus of data (e.g. your typical corporate documentation or product catalog)?

I know that I can achieve something similar with a Code Interpreter-like approach, so in theory I could build a multi-step reasoning agent that starting from my query and the data would try to (1) discover the schema and then (2) crunch the data to try to get to my answer, but I don't know how scalable this approach would effectively be.

thewataccount · on Sept 14, 2023

Just to clarify - are you wanting the LLM itself to identify what a "anomalous latency" would be based on the data itself? If so then I don't think this will help you at all until we can actually fit the log into the context.

What RAG here is doing is using embeddings and a vector store to identify close pieces of information, for example "in this django project add a textfield" will be very close to documentation in the django docs that say "textfield", and it will then add that to the prompt so the LLM has the relevant docs in its context.

The problem is that you'll need a heuristic to identify at least "potentially anomalous" and even then you'll still have to make sure there's enough context for it to know "is this a normal daily fluctuation".

A multi-step agent is definitely what you want, you could have it build an SQL query itself, for example "was there any high latency requests yesterday?" it may identify it should filter the time, possibly design the query to determine what is "high".

---

At the moment I don't think it's well suited to identifying when the "latency is abnormally high". However if you have some other system/human identify heuristics to feed to the LLM, it may then be able to do at least answer the query.

deanmoriarty · on Sept 14, 2023

Yes, this clarifies well what is possible vs not.

I was trying to understand if there is an opportunity to introduce some of this technology to solve “anomaly detection” on large amount of structured data, where anomaly might be an incredibly overloaded term (it might imply a performance regression, a security issue, etc). That is a business need I have today.

It seems that what is possible today is an assistant that can aid a user to get to these answers faster (by, for instance, suggesting a SQL query based on the schema, etc). Again, roughly the equivalent of what Code Interpreter does, just without the local environment limitations.

gogogo_allday · on Sept 15, 2023

This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

IKantRead · on Sept 14, 2023

> can the RAG pattern be useful here?

From your questions it looks like you are only interesting in the R part. RAG implies the retrieval step is then used to augment a user prompt.

To answer 1, a good heuristic would be "can a human reasonably familiar with the terminology answer questions about the meaning?" If a human would need extra info to make sense of your data then so would an LLM.

This is where RAG typically comes in. For example if you had documentation about ClassName and FunctionName, a retrieval model might be able to find the most likely candidates based on a file containing full definitions of these classes and function, then pass that info into the LLM appended to your query.

For 2: It depends if the fire house is the query or the data. If you have queries coming in very quickly, then you might be able to if your firehose doesn't have too much volume since you can batch requests and get responses fairly quickly.

If the fire hose is the data going into the vector DB then you might have some difficultly inserting and indexing the data fast enough.

warkdarrior · on Sept 14, 2023

For this kind of structured data and this kind of structured queries, it may be more useful to stick to a data query language (SQL, or some analytics engine).

deanmoriarty · on Sept 14, 2023

Thanks. I wonder if a reasonable approach could then be to first insert the data in a datawarehouse-like database suitable for analytics, and then use an LLM application to (1) generate SQL queries that could answer my question, reasoning about the schema (2) potentially summarize the output result set. It could still result in a significant boost of productivity.

warkdarrior · on Sept 14, 2023

Indeed, that is a promising path. Fundamentally you still want to reply on a human to figure out what analytics are interesting to consider, then having the LLM act as a helper that generates queries corresponding to the analytics.

gsuuon · on Sept 14, 2023

Wow this was indeed super comprehensive. A few things I noticed:

- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'

- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?

- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model

- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.

- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?

Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.

yujian · on Sept 14, 2023

Anyscale consistently posts great projects. Very cool to see the cost comparison and quality comparison. Not surprising to see that OSS is less expensive, but also rated as slightly lower quality than gpt-3.5-turbo.

I do wonder, is there some bias in quality measures? Using GPT 4 to evaluate GPT 4's output? https://www.linkedin.com/feed/update/urn:li:activity:7103398...

robertnishihara · on Sept 15, 2023

Here is the blog post accompanying the notebook

https://www.anyscale.com/blog/a-comprehensive-guide-for-buil...

tshrjn007 · on Sept 14, 2023

What do you use to generate the diagrams in the post? Super Neat.