Depends on what you're doing.

addaon · 2025-08-07T19:08:09 1754593689

> Depends on what you're doing.

Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.

wongarsu · 2025-08-07T20:06:10 1754597170

If cost is no concern (as in infrequent one-off tasks) then you can always go with the biggest model with the most reasoning. Maybe compare it with the biggest model with no/less reasoning, since sometimes reasoning can hurt (just as with humans overthinking something).

If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth

Breza · 2025-08-15T03:51:05 1755229865

I agree. Public benchmarks aren't very useful for a bunch of reasons. Any company relying on LLMs for a critical function should have its own internal benchmark system. I maintain such a system for my job. If you are able, use the same prompt every time. It's fun to be able to include models like the original Bard on our leader board.