Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Input price difference: 4.5 is 30x more

Output price difference:4.5 is 15x more

In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here.



If you ran the same query set 30x or 15x on the cheaper model (and compensated for all the extra tokens the reasoning model uses), would you be able to realize the same 26% quality gain in a machine-adjudicatible kind of way?


with a reasoning model you'd get better than both.


Exactly. Not sure why you'd pick GPT 4.5 over lots of GPT 4o queries or an o1 query


Ignoring latency for a second, one of the tricks for boosting quality is to utilize consensus. One probability does not need to call the lesser model 30x as much to achieve these gains sorta of gains. Moreover you have to take the purported gains with a grain of salt. The models are probably trained on the evaluation sets they are benchmarked against.


Einstein's IQ = 3.5x chimpanzees IQs, right?


3.5x on a normal distribution with mean 100 and SD 15 is pretty insane. But I agree with your point, being 26% better at a certain benchmark could be a tiny difference, or an incredible improvement (imagine the hardest questions being Riemann hypothesis, P != NP, etc).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: