Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This + some benchmarks are shitty thus rational model should be allowed to not answer them but ask claryfying questions.


Yes, a lot of those have pretty egregious annotation mistakes. Once you get in high percentage it's often worth going through your dataset with your model prediction and compare. Obviously you can't do that on academic benchmarks (though some papers still do).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: