I do think it’s an interesting line of inquiry… but not robust enough. E.g. this...

I do think it’s an interesting line of inquiry… but not robust enough.

E.g. this paper would be much more interesting if it measured the threshold at which the LLM starts to become good at X, and linked that threshold to the number and character of training examples of X. Then, maybe, we can begin to think about comparing the LLM to a human.

Alas, it requires access to the training data to do that study, and it requires a vast amount of compute to do it robustly.