Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/ge...

Maxion · on Dec 6, 2023

Lol that's why it's hidden in a PDF.

They basically announced GPT 3.5, then. Big woop, by the time Ultra is out GPT-5 is probably also out.

dmix · on Dec 6, 2023

Isn't having GPT 3.5 still a pretty big deal? Obviously they are behind but does anyone else offer that?

3.5 is still highly capable and Google investing a lot into making it multi modal combined with potential integration with their other products makes it quite valuable. Not everyone likes having to switch to ChatGPT for queries.

DebtDeflation · on Dec 6, 2023

Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2. If Google just released something (Gemini Pro) on par with GPT 3.5 and will release something (Gemini Ultra) on par with GPT 4 in Q1 of next year while actively working on Gemini V2, they are very much back in the game.

Tiberium · on Dec 6, 2023

I'd have to disagree a bit - Claude 2 is better than 3.5 in my experience (maybe in benchmarks too, I haven't searched for them specifically), but worse than GPT-4

dragonwriter · on Dec 6, 2023

> Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2.

Is it though? I mean, free (gratis) public locally-usable models are more than just "Llama2", and Llama2 itself is pretty far down the HuggingFace open model leaderboard. (It's true a lot of the models above it are Llama2 derivatives, but that's not universally true, either.)

Keyframe · on Dec 6, 2023

Obviously they are behind but does anyone else offer that?

Claude by Anthropic is out and offers more and is being actively used

generalizations · on Dec 6, 2023

I thought there were some open-source models in the 70-120B range that were GPT3.5 quality?

dragonwriter · on Dec 6, 2023

Measuring LLM quality is problematic (and may not even be meaningful in a general sense, the idea that there is a measurable strict ordering of general quality that is applicable to all use cases, or even strongly predictive of utiity for particular uses, may be erroneous.)

If you trust Winogrande scores (one of the few where I could find GPT3.5 and GPT4 [0] ratings that is also on the HuggingFace leaderboard [1]), there are a lot of models between GPT3.5 and GPT4 with some of them being 34B parameter models (Yi-34b and its derivatives), and una_cybertron_7b comes close to GPT3.5.

[0] https://llm-leaderboard.streamlit.app/

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

nkohari · on Dec 6, 2023

It depends on what's being evaluated, but from what I've read, Mistral is also fairly competitive at a much smaller size.

One of the biggest problems right now is that there isn't really a great way to evaluate the performance of models, which (among other issues) results in every major foundation model release claiming to be competitive with the SOTA.

satchlj · on Dec 6, 2023

Yup, it's all a performance for the investors

Racing0461 · on Dec 6, 2023

+1. The investors are the customers of this release, not end users.

daveguy · on Dec 6, 2023

Table 2 indicates Pro is generally closer to 4 than 3.5 and Ultra is on par with 4.

caesil · on Dec 6, 2023

If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing

Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath

KaoruAoiShiho · on Dec 6, 2023

Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.

glenstein · on Dec 6, 2023

Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.

So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.

daveguy · on Dec 6, 2023

> at least if you assume the benchmarks to be of equal significance.

That is an excellent point. Performance of Pro will definitely depend on the use case given the variability between 3.5 to 4. It will be interesting to see user reviews on different tasks. But the 2 quarter lead time for Ultra means it may as well not be announced. A lot can happen in 3-6 months.