Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT-4.5 Preview scored 45% on aider's polyglot coding benchmark [0]. OpenAI describes it as "good at creative tasks" [1], so perhaps it is not primarily intended for coding.

  65% Sonnet 3.7, 32k think tokens (SOTA)
  60% Sonnet 3.7, no thinking
  48% DeepSeek V3
  45% GPT 4.5 Preview <===
  27% ChatGPT-4o
  23% GPT-4o
[0] https://aider.chat/docs/leaderboards/

[1] https://platform.openai.com/docs/models#gpt-4-5



I was waiting for your comment and wow... that's bad.

I guess they are ceding the LLMs for coding market to Anthropic? I remember seeing an industry report somewhere and it claimed software development is the largest user of LLMs, so it seems weird to give up in this area.


4.5 lies on a different path than their STEM models.

o3-mini is an extremely powerful coding model and unquestionably is in the same league as 3.7. o3 is still the top stem overall model.


No way, I've found o3 mini to be terrible. It' not as good as R1/Sonnet 3.5.


I assume they go all in "the new google" direction. Embedded ads coming soon I guess in the free version (chat.com).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: