What's with the dropped benchmark performance compared to the original o3 releas...

refulgentis · 2025-06-11T00:01:27 1749600087

What dropped benchmark performance?

MattDaEskimo · 2025-06-11T14:06:37 1749650797

o3 scores noticeably worse on benchmarks compared to its original announcement benchmarks

refulgentis · 2025-06-11T16:07:14 1749658034

Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.

MattDaEskimo · 2025-06-11T17:22:32 1749662552

Yes, the original announcement for o3 and o4-mini:

https://openai.com/index/introducing-o3-and-o4-mini/

o3 scored 91.6 on AIME 2024. 83.3 on GPQA

o4-mini scored 93.4, 81.4 GPQA

Then, the new announcement

https://help.openai.com/en/articles/6825453-chatgpt-release-...

o3 scored 90 on AIME 2024, 81 GPQA

o4-mini wasn't measured

---

Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with

tedsanders · 2025-06-16T02:00:49 1750039249

First post measures o3 at high reasoning effort. Second measures at medium reasoning effort. It’s the same model, then and now.