If you ran the same query set 30x or 15x on the cheaper model (and compensated for all the extra tokens the reasoning model uses), would you be able to realize the same 26% quality gain in a machine-adjudicatible kind of way?
Ignoring latency for a second, one of the tricks for boosting quality is to utilize consensus. One probability does not need to call the lesser model 30x as much to achieve these gains sorta of gains. Moreover you have to take the purported gains with a grain of salt. The models are probably trained on the evaluation sets they are benchmarked against.
3.5x on a normal distribution with mean 100 and SD 15 is pretty insane. But I agree with your point, being 26% better at a certain benchmark could be a tiny difference, or an incredible improvement (imagine the hardest questions being Riemann hypothesis, P != NP, etc).
Output price difference:4.5 is 15x more
In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here.