Input price difference: 4.5 is 30x more Output price difference:4.5 is 15x more ...

alwa · 2025-02-27T20:49:33 1740689373

If you ran the same query set 30x or 15x on the cheaper model (and compensated for all the extra tokens the reasoning model uses), would you be able to realize the same 26% quality gain in a machine-adjudicatible kind of way?

j_maffe · 2025-02-27T21:12:50 1740690770

with a reasoning model you'd get better than both.

MattSayar · 2025-02-28T18:00:36 1740765636

Exactly. Not sure why you'd pick GPT 4.5 over lots of GPT 4o queries or an o1 query

smohare · 2025-02-27T23:20:34 1740698434

Ignoring latency for a second, one of the tricks for boosting quality is to utilize consensus. One probability does not need to call the lesser model 30x as much to achieve these gains sorta of gains. Moreover you have to take the purported gains with a grain of salt. The models are probably trained on the evaluation sets they are benchmarked against.

mirekrusin · 2025-02-27T21:33:51 1740692031

Einstein's IQ = 3.5x chimpanzees IQs, right?

redox99 · 2025-02-27T21:44:30 1740692670

3.5x on a normal distribution with mean 100 and SD 15 is pretty insane. But I agree with your point, being 26% better at a certain benchmark could be a tiny difference, or an incredible improvement (imagine the hardest questions being Riemann hypothesis, P != NP, etc).