Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd encourage people to test for themselves (and to let the Chatbot Arena scores to settle) before getting caught up in too much hype. I just did a personal eval and I found gemma-2-27b-it (tested on AI Studio) performed far worse in my testing than Llama 3 70B, especially for reasoning and basic world understanding queries.


I also prefer to use "Coding" or "Hard Prompts (Overall)" instead of default "Overall" in Chatbot Arena scores to determine the actual performance level of LLMs. Seems much more align to my vibe test in terms reasoning. I guess the "Overall" contains a lot of creative tasks, which is not what I use the most in the daily tasks.


Same. I tried 27B and found it to be not even close to llama3-70b.

Even llama-8b did better in some of my tests than Gemma 27b.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: