My bet is that this is the reason they are scoring high in "their" benchmarks. For model which are just trained on completely unlabelled data like llama, 0 shot won't work well.
e.g. For llama Hellaswag accuracy is 57.13% in their benchmark compared to 78.59% in [1].
My bet is that this is the reason they are scoring high in "their" benchmarks. For model which are just trained on completely unlabelled data like llama, 0 shot won't work well.
e.g. For llama Hellaswag accuracy is 57.13% in their benchmark compared to 78.59% in [1].
[1]: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...