Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.


To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: