Hacker News new | past | comments | ask | show | jobs | submit login

+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.

Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.

Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?




Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.

But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.

That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.

I would give it a shot on your codebase. I think it should work.


RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.

The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.

Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.

This is still a framing that says: Your question has an answer, and the computer can give it to you.

1 https://buttondown.com/maiht3k/archive/information-literacy-...


We actually don't use RAG! It's not that good as you say.

We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!

This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.


It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)


Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.


RAG is a search step in an attempt to put relevant context into a prompt before performing inference. You are “augmenting” the prompt by “retrieving” information from a data set before giving it to an LLM to “generate” a response. The data set may be the internet, or a code base, or text files. The typical examples online uses an embedding model and a vector database for the search step, but doing a web query before inference is also RAG. Perplexity.ai is a RAG (but fairly good quality). I would argue that Codebuff’s directory tree search to find relevant files is a search step. It’s not the same as a similarity search on vector embeddings, and it’s not PageRank, but it is a search step.

Things that aren’t RAG, but are also ways to get a LLM to “know” things that it didn’t know prior:

1. Fine-tuning with your custom training data, since it modifies the model weights instead of adding context. 2. LoRA with your custom training data, since it adds a few layers on top of a foundation model. 3. Stuffing all your context into the prompt, since there is no search step being performed.


Gotcha – so broadly encompasses how we give external context to the LLM. Appreciate the extra note about vector databases, that's where I've heard this term used most, but I'm glad to know it extends beyond that. Thanks for explaining!


Not RAG: asking the LLM to generate using its internal weights only

RAG: providing the LLM with contextual data you’ve pulled from outside its weights that you believe relate to a query


Nice, super simple. We're definitely fitting into this definition of RAG then!


I think parsimo2010 gave a good definition. If you're pulling context from somewhere using some search process to include as input to the LLM, I would call that RAG.

So I would not consider something like using a system prompt (which does add context, but does not involve search) would not be RAG. Also, using an LLM to generate search terms before returning query results would not be RAG because the output of the search is not input to the LLM.

I would also probably not categorize a system similar to Codebuff that just adds the entire repository as context to be RAG since there's not really a search process involved. I could see that being a bit of a grey area though.


> We build a description of the codebase including the file tree and parsed function names and class names

This sounds like RAG and also that you’re building an index? Did you just mean that you’re not using vector search over embeddings for the retrieval part, or have I missed something fundamental here?


Ah yeah, that's what I mean! I thought RAG is synonymous with this vector search approach.

Either way, we do the search step a little different and it works well.


Any kind of search prior for content to provide as context to the LLM prompt is RAG. The goal is to leverage traditional information retrieval as a source of context. https://cloud.google.com/use-cases/retrieval-augmented-gener...

I'm currently working on a demonstration/POC system using my ElasticSearch as my content source, generating embeddings from that content, and passing them to my local LLM.


It would be cool to be talking to other people about the RAG systems they’re building. I’m working in a silo at the moment, and pretty sure that I’m reinventing a lot of techniques


I didn't mean to be down on it, and I'm really glad it's working well! If you start to reach the limits of what you can achieve with your current approach, there are lots of cute tricks you can steal from RAG, eg nothing stopping you doing a fuzzy keyword search for interesting-looking identifiers on larger codebases rather than giving the LLM the whole thing in-prompt, for example


I'll need to get approval to use this on that codebase. I've tried it out on a smaller open-source codebase as a first step.

For anyone interested:

  - here's the Codebuff session: https://gist.github.com/craigds/b51bbd1aa19f2725c8276c5ad36947e2
  - The result was this PR: https://github.com/koordinates/kart/pull/1011
It required a bit of back and forth to produce a relatively small change, and I think it was a bit too narrow with the files it selected (it missed updating the implementations of a method in some subclasses, since it didn't look at those files)

So I'm not sure if this saved me time, but it's nevertheless promising! I'm looking forward to what it will be capable of in 6mo.


What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?

Forgive my naivety, I don't now anything about LLMs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: