I misspoke. My question is whether the model was fine-tuned on multiple prompt/response pairs (with later prompts referencing earlier ones) or only on one-shot prompts. I expect fine-tuning on e.g.:
User: Pick an animal. Bot: Horse User: How many legs does it have? Bot: 4
to be different from just fine-tuning it on User: Pick an animal. Bot: Horse
Especially when it comes to e.g. training the bot to ask follow-up questions, etc.
32k tokens is a novella to short novel length, so it can handle an extremely long conversation, but, sure, its possible they are using summarization and/or storage-and-retrieval hacks to extend beyond what fits in the context space naively.
Because it introduces a possibly recursive application of the model itself and is a bit messy in that there would be some loss of information. It certainly would be a good technique for external applications, though.
Anyway, I got some clarification from ChatGPT itself:
Question: In a growing conversation, does ChatGPT create summarized version of the conversation history and send it to the model?
Answer:
In a growing conversation, ChatGPT does not create a summarized version of the entire conversation history and send it to the model. Instead, the model only uses the most recent turns of the conversation to generate a response to the current turn.
However, some implementations of ChatGPT do use a technique called "history trimming" to limit the size of the context window that the model considers. History trimming involves discarding some of the oldest turns of the conversation to keep the context window within a manageable size. This allows the model to focus on the most recent turns of the conversation, which may be more relevant to the current turn.
For example, OpenAI's GPT-3 API includes a feature called "context chunking" that automatically trims the conversation history to a maximum length of 2048 tokens. If the conversation history exceeds this length, the oldest turns are discarded to keep the context window within the maximum length.
Overall, while ChatGPT does not create a summarized version of the entire conversation history, some implementations may use history trimming to limit the size of the context window and improve the model's performance.
Q: In a growing conversation, does ChatGPT create a summarized version of the conversation history to build the prompt when it's getting too big for its context size?
A: Yes, ChatGPT may create a summarized version of the conversation history when it's getting too big for its context size. This is because the GPT-3.5 architecture used by ChatGPT has a limited context size, which means that it can only take into account a certain number of previous messages in the conversation when generating a response.
To work around this limitation, ChatGPT may use a summarization algorithm to condense the conversation history into a more manageable size, while still preserving the key points and important information. This summarization process may involve techniques such as text clustering, topic modeling, or text summarization algorithms like BERTSum, T5, or GPT itself.
Once the conversation history has been summarized, ChatGPT can use the condensed version as the prompt to generate a response that is more focused and relevant to the current topic of discussion.
That's how chat with LLMs works. The LLM is not stateful, with every response you need to submit the entirety of the past conversation including its own past output.
> a bit messy in that there would be some loss of information
There has to be a loss of information either way. Their API calls for the web interface point very strongly toward enabling summarization once you exceed a given length.
> Anyway, I got some clarification from ChatGPT itself
As the other comment points out, what you got was a complete hallucination.
"Context chunking" is not a thing, there's a context window and if you submit more than it can fit, the API returns an error response. If you almost fill the window, it returns what it can and but raises an error flag.
There's a system prompt, but ChatGPT is using that to kick off the conversation, you can see in the API calls that even your first message is submitted a user message, and storing that system prompt is still external to the LLM
Wow yes point taken. In the absence of any definitive description of what it actually does, I supposed we have to piece together info from the API docs and possibly the InstructGPT paper
It's not a secret, in OpenAI api you have to keep sending previous question and responses on top of your new question, essentially you are asking new question but give it more context with the previous questions and answers
Thank you for clarifying this mystery. My mental model of how ChatGPT works is now clearer. I was somehow thinking that in chat mode it would either (a) need to “update state”, or (b) be fed increasingly longer history. I thought (b) would slow down later responses but I guess with their ginormous compute it’s not perceptible to users.
Someone made a version with a buffer which only has a fixed size context history so it stays in constant speed mode. It will only focus on the latest questions.
Yes. The chat functionality is really just a clever trick. All LLMs are still just predicting the next token in a sequence (i.e. finishing the sentence).