Chess is essentially a puzzle. There's a single explicit, quantifiable goal, and...

Chess is essentially a puzzle. There's a single explicit, quantifiable goal, and a solution either achieves the goal or it doesn't.

Solving puzzles is a specific cognitive task, not a general one.

Language is a continuum, not a puzzle. The problem with LLMs is that testing has been reduced to performance on language puzzles, mostly with hard edges - like bar exams, or letter counting - and they're a small subset of general language use.