In terms of the actual article -- really nice finding. Or I guess, nice set of e...

In terms of the actual article -- really nice finding. Or I guess, nice set of experiments to decipher what lots of LLM researchers have been finding!

I've noticed somewhat similar behavior while training graph neural networks to model physical systems, except that it takes way longer than a single epoch to get there. Or course, there's no pretending involved with my GNNs, but the models do have very constrained representations, so once they start to figure out how to represent the physics at hand, the loss plummets dramatically.