Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah it was an abysmal result (any 50%+ hallucination result in that bench is pretty bad) and worse than o1-mini in the SimpleQA paper. On that topic, Sonnet 3.5 ”Old” hallucinates less than GPT-4.5, just for a bit of added perspective here.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: