My emerging conception of this is to split this into two separate questions: 1. ...

zmmmmm · on May 9, 2023

That's a useful breakdown in terms of how to think about it.

One angle I am curious about is whether it's to some extent an artefact of how you regularise the model as much as the number of parameters and other factors.

You can think about it in terms of, if you regularise it enough then you force the network instead of fitting specific data points, to actually start learning logic internally because that is the only thing generalisable enough to allow it to produce realistic text for such a diverse range of prompts. You have to have enough parameters that this is even possible, but once you do, the right training / regularisation essentially starts to inevitably force it into that approach rather than the more direct nearest-neighber style "produce something similar to what someone said once before" mechanism.