I'd encourage people to test for themselves (and to let the Chatbot Arena scores...

WiSaGaN · on June 28, 2024

I also prefer to use "Coding" or "Hard Prompts (Overall)" instead of default "Overall" in Chatbot Arena scores to determine the actual performance level of LLMs. Seems much more align to my vibe test in terms reasoning. I guess the "Overall" contains a lot of creative tasks, which is not what I use the most in the daily tasks.

nacs · on June 28, 2024

Same. I tried 27B and found it to be not even close to llama3-70b.

Even llama-8b did better in some of my tests than Gemma 27b.