As a benchmark, why do you find the 'opinion' of an LLM useful? The question is ...

Topfi · 2025-02-27T20:29:14 1740688154

Not OP, but likely because that was the only metric/benchmark/however you want to call it OpenAI showcased in the stream and on the blog to highlight the improvement between 4o and 4.5. To say that this is not really a good metric for comparison, not least because prompting can have a massive impact in this regard, would be an understatement.