I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).
Hrm it is a bit funny that modern cars are drive-by-wire (at least for throttle) and yet they still require a skilled driver to follow a speed profile during testing, when theoretically the same thing could be done more precisely by a device plugged in through the OBD2 port.