Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The best thing would be blind preference tests for a wide variety of problems across domains but unfortunately even these can be gamed if desired. The upside is that they are gamed by being explicitly malicious which I'd imagine would result in whistleblowing at some point. However Claude's position on leaderboards outside of webdev arena makes me skeptical.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: