Cofounder of Websim here. Right now it's not clear that there's any eval for a language model's simulation capabilities. Internally, we've (vibe) tested Llama 3, Command R+, WizardLM 8x22b, Mistral Large (first version of Websim came out of a Mistral hackathon) and GPT-4 Turbo and found them all lacking, due to either meh website outputs or mode collapse from reinforcement learning (lack of creativity and flexibility). That also may be a "skill issue" thing because our system prompt is very much optimized for Claude 3's "mind." We'll release functionality in the next week or two that lets users update the system prompt, in which case this may be less of an issue
Claude 3 has a much broader latent space, and seems to "enjoy" imagining things. It hasn’t been banged into too specific of an assistant shape, and doesn’t suffer the same degree of “mode collapse” https://lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-m...
Even Sonnet produces mindblowingly good outputs (https://x.com/RobertHaisfield/status/1774579381132050696). Haiku is capable of producing full websites with insightful and creative content, even if it isn't as capable as Sonnet/Opus. For example, I found Curio, an esolang where every line of code is a living, sentient being with its own unique personality, memories, and goals, mostly by browsing around with Haiku (https://x.com/RobertHaisfield/status/1782586807261233620). Although Haiku tends to perform better when it is few-shot prompted with outputs from Sonnet or Opus earlier in the "browser history."
22+ odd years ago I emailed Ross out of the blue. At the time I was working at a small startup attempting to build mobile banking infrastructure for rural poor in South Africa and elsewhere. I was after a copy of a paper he had mentioned in a talk and Ross replied with a blunt to the point email about how hard he thought the problem domain we were trying to tackle was (along with a pointer to the paper I had asked about). I remember being slightly annoyed by the initial tone of his response, but it got me thinking - then thinking a lot more. There were some very well thought out reasons behind his arguments. I replied a few days later with a detailed list of how we were addressing his concerns along with some others he hadn't mentioned but also acknowledging the areas we needed to dig into further. I didn't really expect an answer, after all, I didn't know him, but he had got me to think hard on key problems and I wanted to acknowledge that.
We got a lot more than a simple reply. We got his focused feedback, constructive criticisms, pointers to other work he thought was relevant, general support, some key follow up conversations on the phone, followed by introductions to folks he knew in industry who he thought could help (who would have never entertained us at that stage otherwise). While the startup eventually didn't make it, many of the ideas we worked on did. Of the folks that we met through that whole adventure, Ross was one of the standouts. Yup, he was very good people indeed and will be sadly missed!
I get the tokenization argument and it may influence it a bit, but I suspect the n-digit math issue has to do more with search the way it samples (in the bpe link gwern references some experiements I'd done with improving n-digit math by chunking using commas, http://gptprompts.wikidot.com/logic:math). I think since it samples left to right on the first pass, it's not able to predict well if things carry from right to left.
Yup, quite possible that this has something to do with it. There is other work showing that giving LMs a "scratchpad" for intermediate computations allows them to do much better not just at arithmetic but also things like predicting the output of some code: https://arxiv.org/abs/2112.00114
I think the check and validate is a different sort of scratchpad but maybe not. Seems like at least 3 types - soe for pulling implicit info out of the network viz wic, sometimes for intermediary steps viz coding, sometimes for verification like here.
The big caveat here is that the inner monologue papers generally work with GPT-3-175b, LaMDA, or Gopher, all of which are much bigger than 20b, and they generally show phase transitions (https://old.reddit.com/r/mlscaling/comments/sjzvl0/d_instanc...) in the monologue capability: below a critical size, inner monologue doesn't work at all, performing worse than baseline even, no matter how they scale, and only past the critical size does inner monologue suddenly start working much better. So it's possible (has anyone checked?) that GPT-NeoX-20b just isn't large enough to do inner monologue.
yeah, that's a very big caveat - haven't checked neo 20b yet. I've had a hard time getting the AI21 models to use it and those are also pretty big so it's interesting why sometimes it works and sometimes it doesn't. (and Davinci > Codegen Davinci > Curie > J-6B). Fine tunes can also learn to do the inner monologue as well which is really cool - not sure how much is architecture vs. training parameters.
Couple things there where you can see if it improves with the prompt/formatting. E.g. with Davinci (and J a bit but didn't test too much) you can get bette results by:
- Using few-shot examples of similar length to the targets (e.g. 10 digit math, use 10 digit few shots)
- Chunking numbers with commas
- Having it double check itself
This has been a systemic issue reported on for years; e.g. reported by the Intercept in 2017 [1] and Atlantic in 2019 [2]. Not really made clear from the story considering the Economist headline is almost identical to the Atlantic one.
It's a good one! You helping out with the ML at Neuralink at all? Maybe someday Autopilot could learn something from recordings of people's brains while driving :)
Just priming an immediate availability response is likely going to get poor results.
On the other hand, this does bring up an important point, which is that few people have been systemically trying to figure out how to get it to reason through problems. For instance, if you try the pure completion on WiC you get 50% chance (like in the paper) but if you improve the prompt to self-context stuff you raise it to almost 70% (http://gptprompts.wikidot.com/linguistics:word-in-context).