> Sometimes I wonder if there is overfitting towards benchmarks There absolutely...

> Sometimes I wonder if there is overfitting towards benchmarks

There absolutely is, even when it isn't intended.

The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.

(Ok, not every problem, there's also sample efficiency, and…)