I'd be very hard-pressed to call this "human behavior". Moving a sprite to a region called "bathroom" and then showing a speech bubble with a picture of a toothbrush and a tooth isn't the same as someone in a real bathroom brushing their teeth. What you can say is if you can sufficiently reduce behavior to discrete actions and gridded regions in a pixel world, you can use an LLM to produce movesets that sound plausible because they are relying on training data that indicates real-world activity. And if you then have a completely separate process manage the output from many LLMs, you can auto-generate some game behavior that is interesting or fun. That's a great result in itself without the hype!
The emojis in the speech bubbles are just summaries of their current state. In the demo, if you click on each person you can see the full text of their current state, e.g. "Brushing her teeth" or "taking a walk around Johnson Park (talking to the other park visitors)"
The meaning of the English word "the" is to refer to a specific instance of a thing. noobermin said "AI researchers" meaning some indefinite researchers in general, you said "the researchers" presumably referring to the exact researchers of this paper, so you're talking about a different set of researchers than noobermin, thus failing to refute their claim.