Hacker News new | past | comments | ask | show | jobs | submit login

What's with the dropped benchmark performance compared to the original o3 release? It was disappointing to not see o4-mini on it as well





What dropped benchmark performance?

o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks

Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.

Yes, the original announcement for o3 and o4-mini:

https://openai.com/index/introducing-o3-and-o4-mini/

o3 scored 91.6 on AIME 2024. 83.3 on GPQA

o4-mini scored 93.4, 81.4 GPQA

Then, the new announcement

https://help.openai.com/en/articles/6825453-chatgpt-release-...

o3 scored 90 on AIME 2024, 81 GPQA

o4-mini wasn't measured

---

Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with


First post measures o3 at high reasoning effort. Second measures at medium reasoning effort. It’s the same model, then and now.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: