The performance results here are interesting. G-Ultra seems to meet or exceed GPT4V on all text benchmark tasks with the exception of Hellaswag where there's a significant lag, 87.8% vs 95.3%, respectively.
"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset
composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."
yeah a lot of local models fall short on that benchmark as well. I wonder what was different about GPT3.5/4's training/date that would lead to its great hellaswag perf