Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

Sadly it's 3.5 quality, :(



Lol that's why it's hidden in a PDF.

They basically announced GPT 3.5, then. Big woop, by the time Ultra is out GPT-5 is probably also out.


Isn't having GPT 3.5 still a pretty big deal? Obviously they are behind but does anyone else offer that?

3.5 is still highly capable and Google investing a lot into making it multi modal combined with potential integration with their other products makes it quite valuable. Not everyone likes having to switch to ChatGPT for queries.


Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2. If Google just released something (Gemini Pro) on par with GPT 3.5 and will release something (Gemini Ultra) on par with GPT 4 in Q1 of next year while actively working on Gemini V2, they are very much back in the game.


I'd have to disagree a bit - Claude 2 is better than 3.5 in my experience (maybe in benchmarks too, I haven't searched for them specifically), but worse than GPT-4


> Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2.

Is it though? I mean, free (gratis) public locally-usable models are more than just "Llama2", and Llama2 itself is pretty far down the HuggingFace open model leaderboard. (It's true a lot of the models above it are Llama2 derivatives, but that's not universally true, either.)


Obviously they are behind but does anyone else offer that?

Claude by Anthropic is out and offers more and is being actively used


I thought there were some open-source models in the 70-120B range that were GPT3.5 quality?


Measuring LLM quality is problematic (and may not even be meaningful in a general sense, the idea that there is a measurable strict ordering of general quality that is applicable to all use cases, or even strongly predictive of utiity for particular uses, may be erroneous.)

If you trust Winogrande scores (one of the few where I could find GPT3.5 and GPT4 [0] ratings that is also on the HuggingFace leaderboard [1]), there are a lot of models between GPT3.5 and GPT4 with some of them being 34B parameter models (Yi-34b and its derivatives), and una_cybertron_7b comes close to GPT3.5.

[0] https://llm-leaderboard.streamlit.app/

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...


It depends on what's being evaluated, but from what I've read, Mistral is also fairly competitive at a much smaller size.

One of the biggest problems right now is that there isn't really a great way to evaluate the performance of models, which (among other issues) results in every major foundation model release claiming to be competitive with the SOTA.


Yup, it's all a performance for the investors


+1. The investors are the customers of this release, not end users.


Table 2 indicates Pro is generally closer to 4 than 3.5 and Ultra is on par with 4.


If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing

Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath


Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.


Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.

So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.


> at least if you assume the benchmarks to be of equal significance.

That is an excellent point. Performance of Pro will definitely depend on the use case given the variability between 3.5 to 4. It will be interesting to see user reviews on different tasks. But the 2 quarter lead time for Ultra means it may as well not be announced. A lot can happen in 3-6 months.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: