These are all great points - who or what you ask to manage memory is a design decision and IMO there's two main ways to do it (in the context of chatbots):
* implicit memory management, where the "main LLM" (or for chat, the "dialogue thread") is unaware that memory is being managed in the background (by a "memory LLM", a rule-based script, a small neural network, etc.), and
* explicit memory management (MemGPT), where one LLM does everything
Prior research in multi-session / long-range chat is often implicit, with a designated memory creation process. If I had to guess, I'd say the vast majority of consumer chatbots that implement some type of memory store are also implicit. This is because getting explicit memory management to work requires a lot of complex instruction following, and in our experience this just isn't possible at the moment with most publicly available LLMs (we're actively looking into ways to fix this via eg fine-tuning open models).
The tradeoffs are as you mentioned: with implicit, you don't have to stuff all the memory management instructions into the LLM preprompt (in MemGPT, the total system message is ~1k tokens). But on the upside, explicit memory management (when the LLM works) makes the overall system a lot simpler - there's no need to manage multiple LLM models running on parallel threads, which can add a lot of overhead.
* implicit memory management, where the "main LLM" (or for chat, the "dialogue thread") is unaware that memory is being managed in the background (by a "memory LLM", a rule-based script, a small neural network, etc.), and
* explicit memory management (MemGPT), where one LLM does everything
Prior research in multi-session / long-range chat is often implicit, with a designated memory creation process. If I had to guess, I'd say the vast majority of consumer chatbots that implement some type of memory store are also implicit. This is because getting explicit memory management to work requires a lot of complex instruction following, and in our experience this just isn't possible at the moment with most publicly available LLMs (we're actively looking into ways to fix this via eg fine-tuning open models).
The tradeoffs are as you mentioned: with implicit, you don't have to stuff all the memory management instructions into the LLM preprompt (in MemGPT, the total system message is ~1k tokens). But on the upside, explicit memory management (when the LLM works) makes the overall system a lot simpler - there's no need to manage multiple LLM models running on parallel threads, which can add a lot of overhead.