Hacker News new | past | comments | ask | show | jobs | submit login

The performance results here are interesting. G-Ultra seems to meet or exceed GPT4V on all text benchmark tasks with the exception of Hellaswag where there's a significant lag, 87.8% vs 95.3%, respectively.



I wonder how that weird HellaSwag lag is possible. Is there something really special about that benchmark?


Tech report seems to hint at the fact that GPT-4 may have had some training/testing data contamination and so GPT-4 performance may be overstated.


From the report:

"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."


Great catch!


yeah a lot of local models fall short on that benchmark as well. I wonder what was different about GPT3.5/4's training/date that would lead to its great hellaswag perf




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: