The most famous result is OthelloGPT, where they trained a transformer to complete lists of Othello moves, and the transformer generated an internal model of where the pieces were after each move.
The rough consensus is that if you train a model to predict the output of a system for long enough with weight decay and some nebulous conditions are met (see "lottery ticket hypothesis"), eventually your model develops an internal simulation of how the system works because that simulation uses fewer weights than "memorize millions of patterns found in the system", and weight decay "incentivizes" lower-weight solutions.
The most famous result is OthelloGPT, where they trained a transformer to complete lists of Othello moves, and the transformer generated an internal model of where the pieces were after each move.
The rough consensus is that if you train a model to predict the output of a system for long enough with weight decay and some nebulous conditions are met (see "lottery ticket hypothesis"), eventually your model develops an internal simulation of how the system works because that simulation uses fewer weights than "memorize millions of patterns found in the system", and weight decay "incentivizes" lower-weight solutions.