My understanding was that the metric for LMArena is that one answer is “better” than another, for a deliberately 100% subjective definition of better.
My experience has been that typical LLMs will have more “preamble” to what they say, easing the reader (and priming themselves autoregressively) into answers with some relevant introduction of the subject, sometimes justifying the rationale and implications behind things. But for o1, that transient period and the underlying reasoning behind things is part of OpenAI’s special sauce, and they deliberately and aggressively take steps to hide it from users.
o1 will get correct answers to hard problems more often than other models (look at the math/coding/hard subsections on the leaderboard, where anecdotal experiences aside, it is #1), and there’s a strong correlation between correctness and a high score in those domains because getting code or math “right” matters more than the justification or explanation. But in more general domains where there isn’t necessarily an objective right or wrong, I know the vibe matters a lot more to me, and that’s something o1 struggles with.
My experience has been that typical LLMs will have more “preamble” to what they say, easing the reader (and priming themselves autoregressively) into answers with some relevant introduction of the subject, sometimes justifying the rationale and implications behind things. But for o1, that transient period and the underlying reasoning behind things is part of OpenAI’s special sauce, and they deliberately and aggressively take steps to hide it from users.
o1 will get correct answers to hard problems more often than other models (look at the math/coding/hard subsections on the leaderboard, where anecdotal experiences aside, it is #1), and there’s a strong correlation between correctness and a high score in those domains because getting code or math “right” matters more than the justification or explanation. But in more general domains where there isn’t necessarily an objective right or wrong, I know the vibe matters a lot more to me, and that’s something o1 struggles with.