This is a super helpful breakdown and really helps me understand how the RL step is different than the initial training step. I didn't realize the reward was delayed until the end of the response for the RL step. Having the reward for this step be dependent on the coherent thought rather than a coherent word now seems like an obvious and critical part of how this works.
GP was pondering about code re-use. My typical use involves giving an entire file to the LLM and asking the LLM to give the entire file back implementing requested changes, so that it's forced to keep the full text in context and can't get too off-track by focusing on small sections of code when related changes might be needed in other parts of the file.
I think all of this is getting at the fact that an LLM won't spit out perfect code in response to a lazy prompt unless it's been highly post-trained to "reinterpret" sloppy prompts just as academically as academic prompts. Just like a human programmer, you can just give the programmer project descriptions and wait for the deliverable and accept it at face value, or you can join the programmer along their journey and verify their work is according to the standards you want. And sometimes there is no other way to get a hard project done.
Conversely, sometimes you can give very detailed specifications and the LLM will just ignore part of them over and over. Hopefully the training experts can continue to improve that.
I had a very similar idea a while back. I wanted to rank news by "impact" which might be more concrete than "significance."
For an LLM prompt, it would be something like:
"estimate the number of people who's lives that will be materially changed by this news."
and
"estimate the average degree of change for those impacted."
Then impact is roughly the product of those two.
Additionally, I want a version that is tailored to me specifically "estimate the degree of change this will have on my life." + context of my life.
Tangentially, I've found that getting ratings out LLMs works better when I can give all options and request relative ratings. If I ask for rankings individually I get different and less good results. Not enough context length to rate all news from all time in one go though. Any thoughts on that? Maybe providing some benchmark ratings with each request could help? Something I'm exploring.
What you're describing is super close to the first version I had!
In the beginning I had 3 parameters: scale (number of people), magnitude (degree of change for those impacted) and additionally potential (how likely is this event to trigger downstream significant events).
The point behind including potential was to separate these two events:
1) A 80 year old dies from cancer
2) An 80 year old dies from a new virus called COVID
This worked roughly well but I kept adding parameters to improve the system: novelty, credibility, etc... The current system works on 7 parameters.
---
I never attempted to give LLM all options and rank them against each other.
1) as you said, for me 20k articles is just too much to fit into context window. Maybe some modern LLMs can handle it, but it wasn't the case for a long time, and I settled on current approach.
2) I don't want the "neighbors" to affect individual article ratings. With the current system I am able to compare news spread over months, because they were all rated using the same prompt.
3) I intentionally avoided giving AI examples, like "evaluate event X given that event Y is 7/10". I want it to give scores with a "clear mind" and not be "primed" to my arbitrary examples.
> 3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.
I've been wondering about that and am glad to hear it's working in the wild.
I'm now wondering if using a fine-tuned LLM (on the corpus) to gen the hypothetical answers and then use those for the rag flow would work even better.
The technique of generating hypothetical answers (or documents) from the query was first described in the "HyDE (Hypothetical Document Expansion) paper". [1]
Interestingly, going both ways: generate hypothetical answers for the query, and also generate hypothetical questions for the text chunk at ingestion both increase RAG performance in my experience.
Though LLM-based query-processing is not always suitable for chat applications if inference time is a concer (like near-real time customer support RAG), so ingestion-time hypothetical answer generation is more apt there.
We do this as well with a lot of success. It’s cool to see others kinda independently coalescing around this solution.
What we find really effective is at content ingestion time, we prepend “decorator text” to the document or chunk. This incorporates various metadata about the document (title, author(s), publication date, etc).
Then at query time, we generate a contextual hypothetical document that matches the format of the decorator text.
We add hybrid search (BM25 and rerank) to that, also add filters (documents published between these dates, by this author, this type of content, etc). We have an LLM parameterize those filters and use them as part of our retrieval step.
but what about the chunk size, if we have a small chunks like 1 sentence and the hyde embeddings are most of the time larger, the results are not so good
Papers only work because they know exactly what the view portal is and can design the layout relative to that. Unless you have an a3 sized screen this will not work very well online.
You can achieve some of the proportions with vw and vh units inside the article and column containers. Much of the effect comes from nicely laid out columns more than how many columns wide is your digital broadsheet, so the aesthetic scales okay on smaller screens. On mobile screens it’s just nice-looking individual columns.