This is an interesting overview, thank you. Different tasks, different models, all-day-usage and pretty complete (while still opinionated, which I like).
However, checking the results my personal overall winner if I had to pick only ONE probably would be
deepseek/deepseek-chat-v3-0324
which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.
They released deepseek/deepseek-chat-v3.1 shortly after I did the evals, and that's what I now use 20+ times a day for all my questions. It replaces chat-v3 and r1, depending on whether you enable reasoning or not.
However, checking the results my personal overall winner if I had to pick only ONE probably would be
which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.