It's the other way around on their new SWE-Lancer benchmark, which is pretty int...

Topfi · 2025-02-27T20:38:34 1740688714

To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

[0] https://arxiv.org/pdf/2502.12115