The performance results here are interesting. G-Ultra seems to meet or exceed GP...

joelthelion · on Dec 6, 2023

I wonder how that weird HellaSwag lag is possible. Is there something really special about that benchmark?

HereBePandas · on Dec 6, 2023

Tech report seems to hint at the fact that GPT-4 may have had some training/testing data contamination and so GPT-4 performance may be overstated.

smarterclayton · on Dec 6, 2023

From the report:

"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."

ZeroCool2u · on Dec 6, 2023

Great catch!

erikaww · on Dec 6, 2023

yeah a lot of local models fall short on that benchmark as well. I wonder what was different about GPT3.5/4's training/date that would lead to its great hellaswag perf