So, better than GPT4 according to the benchmarks? Looks very interesting. Techni...

Palmik · on Dec 6, 2023

The table is *highly* misleading. It uses different methodologies all over the place.

For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.

For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.

Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.

What a mess of a "paper".

Imnimo · on Dec 6, 2023

It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.

mring33621 · on Dec 6, 2023

"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"

rvnx · on Dec 6, 2023

And somehow, when reading the benchmarks, Gemini Pro seems to be a regression compared to PaLM 2-L (the current Bard) :|

eurekin · on Dec 6, 2023

This, and also building the marketing website.

It feels really desperate

red-iron-pine · on Dec 6, 2023

"we have no moat"

visarga · on Dec 6, 2023

Even not having a moat anymore, with their cash they might still be the biggest search provider 10 years from now. IBM still exists and is worth 146B. I wouldn't be surprised if Google still came out ok.

Assuming they use unique data only they have to make a better LLM, then everyone is going to leech training examples from them bringing competition asymptotically closer, but never quite reaching. It's hard to copy-protect a model exposed to the public, as OpenAI is finding out.

Many, many tasks can be executed on local GPUs today without paying a dime to OpenAI, there is no moat. AI likes to learn from other AIs. Give me a million hard problems solved step by step with GPT-5 and I can make Mistral much smarter. Everyone knows this dataset is going to leak in a few months.

hulium · on Dec 6, 2023

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

noway421 · on Dec 7, 2023

> They simply compare the prompting strategies that work best with each model

Incorrect.

# Gemini marketing website, MMLU

- Gemini Ultra 90.0% with CoT@32*

- GPT-4 86.4% with 5-shot* (reported)

# gemini_1_report.pdf, MMLU

- Gemini Ultra 90.0% with CoT@32*

- Gemini Ultra 83.7% with 5-shot

- GPT-4 87.29% with CoT@32 (via API*)

- GPT-4 86.4% with 5-shot (reported)

Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

viscanti · on Dec 6, 2023

The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.

pcshah1996 · on Dec 6, 2023

Some discussion on twitter about misleading evaluation: https://twitter.com/a_a_cabrera/status/1732454328307511807

(nitter: https://nitter.net/a_a_cabrera/status/1732454328307511807#m)

ilaksh · on Dec 6, 2023

That's for Ultra right? Which is an amazing accomplishment, but it sounds like I won't be able to access it for months. If I'm lucky.

freedomben · on Dec 6, 2023

Yep, at this point I'd rather they hold their announcements until everybody can access it, not just the beautiful people. I'm excited and want to try it right now, and would actually use it for a PoC I have in mind, but in a few months the excitement will be gone.

jakderrida · on Dec 6, 2023

It's to their detriment, also. Being told Gemini beats GPT-4 while withholding that what I'm trying out is not the model they're talking about would have me think they're full of crap. They'd be better off making it clear that this is not the one that surpasses GPT-4.

throwitaway222 · on Dec 6, 2023

It really is. OpenAI has the Apple model of release - when it's announced the laptop is in you freaking hands 3 days later.

Google announces vaporware that's never going to come out, or something that will be out in 5 months. It's frustrating and very bad for their image in the LLM space.

fl0ki · on Dec 6, 2023

This might be the best they can do to maintain any hope among nervous investors. That this may actually be the most rational play available to them would be incredibly sad.

EchoReflection · on Dec 6, 2023

I wonder if the "release" was done in spite of dev knowledge that it isn't really ready. Like "screw it, we want to attract eyeballs even though we know it's premature"

thatoneguy · on Dec 6, 2023

And then six months later, it's gone because it's not on target for a $1B ARR and no one is getting promoted for something that's already launched.

interloxia · on Dec 6, 2023

That wasn't my experience with OpenAI. Their availability has been a mixed bag.

pixl97 · on Dec 6, 2023

Just because Apple offers a particular laptop, doesn't mean they won't be sold out. If we want to complete the analogy.

Maxion · on Dec 6, 2023

Yep, the announcement is quite cheeky.

Ultra is out sometime next year, with GPT-4 level capability.

Pro is out now (?) with ??? level capability.

KaoruAoiShiho · on Dec 6, 2023

Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

Sadly it's 3.5 quality, :(

Maxion · on Dec 6, 2023

Lol that's why it's hidden in a PDF.

They basically announced GPT 3.5, then. Big woop, by the time Ultra is out GPT-5 is probably also out.

dmix · on Dec 6, 2023

Isn't having GPT 3.5 still a pretty big deal? Obviously they are behind but does anyone else offer that?

3.5 is still highly capable and Google investing a lot into making it multi modal combined with potential integration with their other products makes it quite valuable. Not everyone likes having to switch to ChatGPT for queries.

DebtDeflation · on Dec 6, 2023

Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2. If Google just released something (Gemini Pro) on par with GPT 3.5 and will release something (Gemini Ultra) on par with GPT 4 in Q1 of next year while actively working on Gemini V2, they are very much back in the game.

Tiberium · on Dec 6, 2023

I'd have to disagree a bit - Claude 2 is better than 3.5 in my experience (maybe in benchmarks too, I haven't searched for them specifically), but worse than GPT-4

dragonwriter · on Dec 6, 2023

> Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2.

Is it though? I mean, free (gratis) public locally-usable models are more than just "Llama2", and Llama2 itself is pretty far down the HuggingFace open model leaderboard. (It's true a lot of the models above it are Llama2 derivatives, but that's not universally true, either.)

Keyframe · on Dec 6, 2023

Obviously they are behind but does anyone else offer that?

Claude by Anthropic is out and offers more and is being actively used

generalizations · on Dec 6, 2023

I thought there were some open-source models in the 70-120B range that were GPT3.5 quality?

dragonwriter · on Dec 6, 2023

Measuring LLM quality is problematic (and may not even be meaningful in a general sense, the idea that there is a measurable strict ordering of general quality that is applicable to all use cases, or even strongly predictive of utiity for particular uses, may be erroneous.)

If you trust Winogrande scores (one of the few where I could find GPT3.5 and GPT4 [0] ratings that is also on the HuggingFace leaderboard [1]), there are a lot of models between GPT3.5 and GPT4 with some of them being 34B parameter models (Yi-34b and its derivatives), and una_cybertron_7b comes close to GPT3.5.

[0] https://llm-leaderboard.streamlit.app/

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

nkohari · on Dec 6, 2023

It depends on what's being evaluated, but from what I've read, Mistral is also fairly competitive at a much smaller size.

One of the biggest problems right now is that there isn't really a great way to evaluate the performance of models, which (among other issues) results in every major foundation model release claiming to be competitive with the SOTA.

satchlj · on Dec 6, 2023

Yup, it's all a performance for the investors

Racing0461 · on Dec 6, 2023

+1. The investors are the customers of this release, not end users.

daveguy · on Dec 6, 2023

Table 2 indicates Pro is generally closer to 4 than 3.5 and Ultra is on par with 4.

caesil · on Dec 6, 2023

If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing

Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath

KaoruAoiShiho · on Dec 6, 2023

Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.

glenstein · on Dec 6, 2023

Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.

So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.

daveguy · on Dec 6, 2023

> at least if you assume the benchmarks to be of equal significance.

That is an excellent point. Performance of Pro will definitely depend on the use case given the variability between 3.5 to 4. It will be interesting to see user reviews on different tasks. But the 2 quarter lead time for Ultra means it may as well not be announced. A lot can happen in 3-6 months.

OscarTheGrinch · on Dec 6, 2023

??? Capability, sometime next year, welcome to the Gemini era.

onlyrealcuzzo · on Dec 6, 2023

The article says "next year" - so that could be as soon as January, right?

borg16 · on Dec 6, 2023

given how google has been functioning, probably as late as December :)

amelius · on Dec 6, 2023

[flagged]

rvnx · on Dec 6, 2023

The real announcement to understand behind the blog post is that PaLM is slowly getting killed.

verdverm · on Dec 6, 2023

There was a waiting period for ChatGPT4 as well, particularly direct API access, and the WebUI had (has?) a paywall

behnamoh · on Dec 6, 2023

I hate this "tierification" of products into categories: normal, pro, max, ultra

Apple does this and it's obvious that they do it to use the "decoy effect" when customers want to shop. Why purchase a measly regular iPhone when you can spend a little more and get the Pro version?

But when it comes to AI, this tierification only leads to disappointment—everyone expects the best models from the FAANGO (including OpenAI), no one expects Google or OpenAI to offer shitty models that underperform their flagships when you can literally run Llama 2 and Mistral models that you can actually own.

crazygringo · on Dec 6, 2023

I don't understand -- these are all literally tied directly to performance.

They're tiers of computing power and memory. More performance costs more money to produce. The "nano" can fit on a phone, while the others can't.

Are you really objecting to the existence of different price/performance tiers...? Do you object to McDonald's selling 3 sizes of soft drink? There's nothing "decoy" about any of this.

rafram · on Dec 6, 2023

> Do you object to McDonald's selling 3 sizes of soft drink?

Yes, actually, for different reasons - McDonald’s charges only a tiny bit more for the largest size of drink than they do for the smallest (which is easy because soft drinks are a few cents’ worth of syrup and water, and the rest is profit). That pushes people toward huge drinks, which means more sugar, more caffeine, and more addiction.

crazygringo · on Dec 6, 2023

But you're not objecting to selling 3 sizes. You're just objecting that the prices aren't far enough apart...

chongli · on Dec 6, 2023

No, it’s not just to use the “decoy effect.” They do this to share development costs across a whole product line. Low volume, expensive products are subsidized by high volume, mass market devices. Without these tiers, they’d be unable to differentiate the products and so lose the margins of the high end products (and their entire reason for existing).

Unless you expect Apple to just sell the high end devices at a loss? Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?

jorvi · on Dec 6, 2023

> They do this to share development costs across a whole product line. Low volume, expensive products are subsidized by high volume, mass market devices

Usually it’s the other way around. Mass market products have thin margins and are subsidized by high end / B2B products because the customers for those products have infinitely deep pockets.

> Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?

Literally what Steve Jobs was steadfast in :). One iPhone for everyone. He even insisted on the Plus models carrying no extra features.

chongli · on Dec 6, 2023

Usually it’s the other way around. Mass market products have thin margins and are subsidized by high end / B2B products because the customers for those products have infinitely deep pockets.

That's usually what I've seen, but the M1 MacBook Air came out first and the M1 Pro and Max came out much later.

pixl97 · on Dec 6, 2023

That's commonly caused by things like low yields for the highest end devices/binning not allowing them to make the numbers of the high end products they need.

pphysch · on Dec 6, 2023

This isn't "tierificaton" or even premiumization. That may come later.

Large AI models have tight resources requirements. You physically can't use X billion parameters without ~X billion ~bytes of memory.

It makes complete sense to have these 3 "tiers". You have a max capability option, a price-performance scaling option, and an edge compute option.

dragonwriter · on Dec 6, 2023

> Large AI models have tight resources requirements. You physically can't use X billion parameters without ~X billion ~bytes of memory.

Well, X billion bits times the parameter bit size. For base models, those are generally 32-bit (so 4X bytes), though smaller quantizations ate possible and widely used for public models, and I would assume as a cost measure for closed hosted models as well.

pphysch · on Dec 6, 2023

Hence ~

mensetmanusman · on Dec 6, 2023

It has to be this way when current LLMs have orders of magnitude electricity cost differences depending on the output you desire.

golol · on Dec 6, 2023

Tierification of AI models is not some business strategy, it is a necessary consequence of the reality that AI is massively compute constrained right now. The size of a model is extremely important for inference time and cost. It just doesn't make sense to release one single model when your method will always yield a family of models with increasing size. The customer can choose a model corresponding to their needs.

giovannibonetti · on Dec 6, 2023

I think the expensive ones are used when the customer is the user — e.g. ChatGPT Plus (personal) subscription — and the cheap ones when they are not — e.g. customer support service bots.

jchw · on Dec 6, 2023

I'm honestly 100% okay with it as long as it's reasonable and not confusing to customers. (Not saying Apple isn't somewhat; I mean, buying a non-Pro iPhone 15 and not being able to view WebM files feels literally fucking insane, and that's apparently how that works, but that's a rant for a different thread.) In cases like this, presumably the idea isn't actually feature-gating, it's scaling up. AI inference costs compute time, and although I have no idea if the inference occurs on special hardware or not, if it does, I can only presume that scaling up the special hardware to meet demand is challenging and very much not like scaling up e.g. a typical web service.

IMO, Tiers can be useful when they make sense and aren't just for artificial market segmentation.

nkohari · on Dec 6, 2023

My guess is they're branding it in this way to obfuscate the number of parameters used, which makes sense because more parameters doesn't necessarily mean a better model. It's kind of like the "number of bits" competition in video game consoles back in the 90s.

dankle · on Dec 6, 2023

I think it depends. It's always worth having a small fast model for some tasks and being able to run it completely offline on a mobile cpu. Maybe not as a chat companion, for for text understanding or indexing all your messages and photos for search, it may be enough.

KoolKat23 · on Dec 6, 2023

It's safe to assume there's good reason in this case. Nano runs locally on smartphones. Pro and Ultra will likely be cost and speed.

city_guy_1 · on Dec 6, 2023

More expensive things cost more money, not a surprise imo

arnaudsm · on Dec 6, 2023

I miss when ML scientific papers had actual science in them. Now they all feel like ads.

behnamoh · on Dec 6, 2023

That's because they're not "scientific papers", they're technical papers.

OscarTheGrinch · on Dec 6, 2023

It's aimed at the AI pontification industrial complex.

miraculixx · on Dec 6, 2023

If it feals like a duck it is a duck. Duh(ck)

yieldcrv · on Dec 6, 2023

I wish Google’s UI would have the same chat interface as ChatGPT or even command line ones I’ve encountered