nice overview of progress over time. are there quant metrics for the sim capabil...

ClassicRob · on April 27, 2024

Cofounder of Websim here. Right now it's not clear that there's any eval for a language model's simulation capabilities. Internally, we've (vibe) tested Llama 3, Command R+, WizardLM 8x22b, Mistral Large (first version of Websim came out of a Mistral hackathon) and GPT-4 Turbo and found them all lacking, due to either meh website outputs or mode collapse from reinforcement learning (lack of creativity and flexibility). That also may be a "skill issue" thing because our system prompt is very much optimized for Claude 3's "mind." We'll release functionality in the next week or two that lets users update the system prompt, in which case this may be less of an issue

Claude 3 has a much broader latent space, and seems to "enjoy" imagining things. It hasn’t been banged into too specific of an assistant shape, and doesn’t suffer the same degree of “mode collapse” https://lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-m...

Even Sonnet produces mindblowingly good outputs (https://x.com/RobertHaisfield/status/1774579381132050696). Haiku is capable of producing full websites with insightful and creative content, even if it isn't as capable as Sonnet/Opus. For example, I found Curio, an esolang where every line of code is a living, sentient being with its own unique personality, memories, and goals, mostly by browsing around with Haiku (https://x.com/RobertHaisfield/status/1782586807261233620). Although Haiku tends to perform better when it is few-shot prompted with outputs from Sonnet or Opus earlier in the "browser history."