Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> All evals are 0 shot

My bet is that this is the reason they are scoring high in "their" benchmarks. For model which are just trained on completely unlabelled data like llama, 0 shot won't work well.

e.g. For llama Hellaswag accuracy is 57.13% in their benchmark compared to 78.59% in [1].

[1]: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...



I think this is simply the default of lm-evaluation-harness. They said they ran every single benchmark they could out of the box.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: