This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.
They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.
So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.
Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.
But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.
Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.
This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.
They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.
So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.
Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.
But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.
Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.