But for convenience maybe I'll just copy them into a comment...
It describes an environment where multiple #LLM (#GPT)-powered agents interact in a small town.
I'll write my notes here as I read it...
To indicate actions in the world they represent them as emoji in the interface, e.g., "Isabella Rodriguez is writing in her journal" is displayed as
You can click on the person to see the exact details, but this emoji summarization is a nice idea for overviews.
A user can interfere (or "steer" if you are feeling generous) the simulation through chatting with agents, but more interestingly they can "issue a directive to an agent in the form of an 'inner voice'"
Truly some miniature Voice Of God stuff here!
I'll see if this is detailed more later in the paper, but initially it sounds like simple prompt injection. Though it's unclear if it's injecting things into the prompt or into some memory module...
Reading "Environmental Interaction" it sounds like they are specifying the environment at a granular level, with status for each object.
This was my initial thought when trying something similar, though now I'm more interested in narrative descriptions; that is, describing the environment to the degree it matters or is interesting, and allowing stereotyped expectations to basically "fill in" the rest. (Though that certainly has its own issues!)
They note the language is stilted and suggest later LLMs could fix this. It's definitely resolvable right now; whatever results they are getting are the results of their prompting.
The conversations remind me of something Nintendo would produce, short, somewhat bland, but affable. They must have worked to make the interactions so short, as that's not GPT default style. But also every example is an instruction, so it might also have slipped in.
Memory is a big fixation right now, though I'm just not convinced. It's obviously important, but is it a primary or secondary concern?
To contrast, some other possible concerns: relationships, mood, motivations, goals, character development, situational awareness... some of these need memory, but many do not. Some are static, but many are not.
To decide on which memories to retrieve they multiply several scores together, including recency. Recency is an exponential decay of 1% per hour.
That seems excessive...? It doesn't feel like recency should ever multiply something down to zero. Though it's recency of access, not recency of creation. And perhaps the world just doesn't get old enough for this to cause problems. (It was limited to 3 days, or about 50% max recency penalty.
The reflection part is much more interesting: given a pool of recent memories they ask the LLM to generate the "3 most salient high-level questions we can answer about the subjects in the statements?"
Then the questions serve to retrieve concrete memories from which the LLM creates observations with citations.
Planning and re-planning are interesting. Agents specifically plan out their days, first with a time outline then with specific breakdowns inside that outline.
For revising plans there's a query process where there is observation, then turning the observation into something longer (fusing memories/etc), and then asking "Should they react to the observation, and if so, what would be an appropriate reaction?"
Interviewing the agents as a means of evaluation is kind of interesting. Self-knowledge becomes the trait that is judged.
Then they cut out parts of the agent and see how well they perform in those same interviews.
Still... the use of quantitative measures here feels a little forced when there's lots of rich qualitative comparisons to be done. I'd rather see individual interactions replayed and compared with different sets of functionality.
They say they didn't replay the entire world with different functionality because each version would drift (which is fair and true). But instead they could just enter into a single moment to do a comparison (assuming each moment is fully serializable).
I've thought about updating world state with operational transforms in part for this purpose, to make rewind and effect tracking into first-class operations.
Well, I'm at the end now. Interesting, but I wish I knew the exact prompts they were using. The details matter a lot. "Boundaries and Errors" touched on this, but that section was 4x the size, there's a lot to be said about the prompts and how they interact with memories and personality descriptions.
I also missed this note: "The present study required substantial time and resources to simulate 25 agents for two days, costing
thousands of dollars in token credit and taking multiple days to complete"
I'm slightly surprised, though if they are doing minute-by-minute ticks of the clock over all the agents then it's unsurprising. (Or even if it's less intensive than that.)
Granularity looks to be 10 seconds, very short! It's not filtering based on memories being expected vs interesting memories, so lots of "X is idle" notes.
If you look at these states the core information (the personality of the person) is very short. There's lots of incidental memories. What matters? What could just be filled in as "life continued as expected"?
One path to greater efficiency might be to encode "what matters" for a character in a way that doesn't require checking in with GPT.
Could you have "boring embeddings"? Embeddings that represent the stuff the eye just passes right over without really thinking about it. Some of training up a character would be to build up this database of disinterest. Perhaps not unlike babies with overconnected brains that need synapse pruning to be able to pay attention to anything at all.
Another option might be for the characters to compose their own "I care about this" triggers, where those triggers are low-cost code (low cost compared to GPT calls) that can be run in a tighter loop in the simulation.
I think this is actually fairly "believable" as a decision process, as it's about building up habituated behavior, which is what believable people do.
Opens the question of what this code would look like...
This is a sneaky way to phrase "AI coding its own soul" as an optimization.
The planning is like this, but I imagine a richer language. Plans are only assertive: try to do this, then that, etc. The addition would be things like "watch out for this" or "decide what to do if this happens" – lots of triggers for the overmind.
Some of those triggers might be similar to "emotional state." Like, keep doing normal stuff unless a feeling goes over some threshold, then reconsider.
I'm going to be genuinely surprised if we don't see an incredibly buggy but incredibly fascinating Sims knockoff in a year or two built around a system like this.
But for convenience maybe I'll just copy them into a comment...
It describes an environment where multiple #LLM (#GPT)-powered agents interact in a small town.
I'll write my notes here as I read it...
To indicate actions in the world they represent them as emoji in the interface, e.g., "Isabella Rodriguez is writing in her journal" is displayed as
You can click on the person to see the exact details, but this emoji summarization is a nice idea for overviews.
A user can interfere (or "steer" if you are feeling generous) the simulation through chatting with agents, but more interestingly they can "issue a directive to an agent in the form of an 'inner voice'"
Truly some miniature Voice Of God stuff here!
I'll see if this is detailed more later in the paper, but initially it sounds like simple prompt injection. Though it's unclear if it's injecting things into the prompt or into some memory module...
Reading "Environmental Interaction" it sounds like they are specifying the environment at a granular level, with status for each object.
This was my initial thought when trying something similar, though now I'm more interested in narrative descriptions; that is, describing the environment to the degree it matters or is interesting, and allowing stereotyped expectations to basically "fill in" the rest. (Though that certainly has its own issues!)
They note the language is stilted and suggest later LLMs could fix this. It's definitely resolvable right now; whatever results they are getting are the results of their prompting.
The conversations remind me of something Nintendo would produce, short, somewhat bland, but affable. They must have worked to make the interactions so short, as that's not GPT default style. But also every example is an instruction, so it might also have slipped in.
Memory is a big fixation right now, though I'm just not convinced. It's obviously important, but is it a primary or secondary concern?
To contrast, some other possible concerns: relationships, mood, motivations, goals, character development, situational awareness... some of these need memory, but many do not. Some are static, but many are not.
To decide on which memories to retrieve they multiply several scores together, including recency. Recency is an exponential decay of 1% per hour.
That seems excessive...? It doesn't feel like recency should ever multiply something down to zero. Though it's recency of access, not recency of creation. And perhaps the world just doesn't get old enough for this to cause problems. (It was limited to 3 days, or about 50% max recency penalty.
The reflection part is much more interesting: given a pool of recent memories they ask the LLM to generate the "3 most salient high-level questions we can answer about the subjects in the statements?"
Then the questions serve to retrieve concrete memories from which the LLM creates observations with citations.
Planning and re-planning are interesting. Agents specifically plan out their days, first with a time outline then with specific breakdowns inside that outline.
For revising plans there's a query process where there is observation, then turning the observation into something longer (fusing memories/etc), and then asking "Should they react to the observation, and if so, what would be an appropriate reaction?"
Interviewing the agents as a means of evaluation is kind of interesting. Self-knowledge becomes the trait that is judged.
Then they cut out parts of the agent and see how well they perform in those same interviews.
Still... the use of quantitative measures here feels a little forced when there's lots of rich qualitative comparisons to be done. I'd rather see individual interactions replayed and compared with different sets of functionality.
They say they didn't replay the entire world with different functionality because each version would drift (which is fair and true). But instead they could just enter into a single moment to do a comparison (assuming each moment is fully serializable).
I've thought about updating world state with operational transforms in part for this purpose, to make rewind and effect tracking into first-class operations.
Well, I'm at the end now. Interesting, but I wish I knew the exact prompts they were using. The details matter a lot. "Boundaries and Errors" touched on this, but that section was 4x the size, there's a lot to be said about the prompts and how they interact with memories and personality descriptions.
...
I realize I missed the online demo: https://reverie.herokuapp.com/arXiv_Demo/
It's a recording of the play run.
I also missed this note: "The present study required substantial time and resources to simulate 25 agents for two days, costing thousands of dollars in token credit and taking multiple days to complete"
I'm slightly surprised, though if they are doing minute-by-minute ticks of the clock over all the agents then it's unsurprising. (Or even if it's less intensive than that.)
You can look at specific memories: https://reverie.herokuapp.com/replay_persona_state/March20_t...
Granularity looks to be 10 seconds, very short! It's not filtering based on memories being expected vs interesting memories, so lots of "X is idle" notes.
If you look at these states the core information (the personality of the person) is very short. There's lots of incidental memories. What matters? What could just be filled in as "life continued as expected"?
One path to greater efficiency might be to encode "what matters" for a character in a way that doesn't require checking in with GPT.
Could you have "boring embeddings"? Embeddings that represent the stuff the eye just passes right over without really thinking about it. Some of training up a character would be to build up this database of disinterest. Perhaps not unlike babies with overconnected brains that need synapse pruning to be able to pay attention to anything at all.
Another option might be for the characters to compose their own "I care about this" triggers, where those triggers are low-cost code (low cost compared to GPT calls) that can be run in a tighter loop in the simulation.
I think this is actually fairly "believable" as a decision process, as it's about building up habituated behavior, which is what believable people do.
Opens the question of what this code would look like...
This is a sneaky way to phrase "AI coding its own soul" as an optimization.
The planning is like this, but I imagine a richer language. Plans are only assertive: try to do this, then that, etc. The addition would be things like "watch out for this" or "decide what to do if this happens" – lots of triggers for the overmind.
Some of those triggers might be similar to "emotional state." Like, keep doing normal stuff unless a feeling goes over some threshold, then reconsider.