- efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019))
- audio input via Universal Speech Model (USM) (Zhang et al., 2023) features
- no audio output? (Figure 2)
- visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022)
- output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b)
- supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF)
I think these are already more details than what we got from OpenAI about GPT4, but on the other side, still only very little details.
It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.
"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"
Even not having a moat anymore, with their cash they might still be the biggest search provider 10 years from now. IBM still exists and is worth 146B. I wouldn't be surprised if Google still came out ok.
Assuming they use unique data only they have to make a better LLM, then everyone is going to leech training examples from them bringing competition asymptotically closer, but never quite reaching. It's hard to copy-protect a model exposed to the public, as OpenAI is finding out.
Many, many tasks can be executed on local GPUs today without paying a dime to OpenAI, there is no moat. AI likes to learn from other AIs. Give me a million hard problems solved step by step with GPT-5 and I can make Mistral much smarter. Everyone knows this dataset is going to leak in a few months.
Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.
They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.
The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.
Yep, at this point I'd rather they hold their announcements until everybody can access it, not just the beautiful people. I'm excited and want to try it right now, and would actually use it for a PoC I have in mind, but in a few months the excitement will be gone.
It's to their detriment, also. Being told Gemini beats GPT-4 while withholding that what I'm trying out is not the model they're talking about would have me think they're full of crap. They'd be better off making it clear that this is not the one that surpasses GPT-4.
It really is. OpenAI has the Apple model of release - when it's announced the laptop is in you freaking hands 3 days later.
Google announces vaporware that's never going to come out, or something that will be out in 5 months. It's frustrating and very bad for their image in the LLM space.
This might be the best they can do to maintain any hope among nervous investors. That this may actually be the most rational play available to them would be incredibly sad.
I wonder if the "release" was done in spite of dev knowledge that it isn't really ready. Like "screw it, we want to attract eyeballs even though we know it's premature"
Isn't having GPT 3.5 still a pretty big deal? Obviously they are behind but does anyone else offer that?
3.5 is still highly capable and Google investing a lot into making it multi modal combined with potential integration with their other products makes it quite valuable. Not everyone likes having to switch to ChatGPT for queries.
Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2. If Google just released something (Gemini Pro) on par with GPT 3.5 and will release something (Gemini Ultra) on par with GPT 4 in Q1 of next year while actively working on Gemini V2, they are very much back in the game.
I'd have to disagree a bit - Claude 2 is better than 3.5 in my experience (maybe in benchmarks too, I haven't searched for them specifically), but worse than GPT-4
> Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2.
Is it though? I mean, free (gratis) public locally-usable models are more than just "Llama2", and Llama2 itself is pretty far down the HuggingFace open model leaderboard. (It's true a lot of the models above it are Llama2 derivatives, but that's not universally true, either.)
Measuring LLM quality is problematic (and may not even be meaningful in a general sense, the idea that there is a measurable strict ordering of general quality that is applicable to all use cases, or even strongly predictive of utiity for particular uses, may be erroneous.)
If you trust Winogrande scores (one of the few where I could find GPT3.5 and GPT4 [0] ratings that is also on the HuggingFace leaderboard [1]), there are a lot of models between GPT3.5 and GPT4 with some of them being 34B parameter models (Yi-34b and its derivatives), and una_cybertron_7b comes close to GPT3.5.
It depends on what's being evaluated, but from what I've read, Mistral is also fairly competitive at a much smaller size.
One of the biggest problems right now is that there isn't really a great way to evaluate the performance of models, which (among other issues) results in every major foundation model release claiming to be competitive with the SOTA.
If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing
Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath
Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.
Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.
So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.
> at least if you assume the benchmarks to be of equal significance.
That is an excellent point. Performance of Pro will definitely depend on the use case given the variability between 3.5 to 4. It will be interesting to see user reviews on different tasks. But the 2 quarter lead time for Ultra means it may as well not be announced. A lot can happen in 3-6 months.
I hate this "tierification" of products into categories: normal, pro, max, ultra
Apple does this and it's obvious that they do it to use the "decoy effect" when customers want to shop. Why purchase a measly regular iPhone when you can spend a little more and get the Pro version?
But when it comes to AI, this tierification only leads to disappointment—everyone expects the best models from the FAANGO (including OpenAI), no one expects Google or OpenAI to offer shitty models that underperform their flagships when you can literally run Llama 2 and Mistral models that you can actually own.
I don't understand -- these are all literally tied directly to performance.
They're tiers of computing power and memory. More performance costs more money to produce. The "nano" can fit on a phone, while the others can't.
Are you really objecting to the existence of different price/performance tiers...? Do you object to McDonald's selling 3 sizes of soft drink? There's nothing "decoy" about any of this.
> Do you object to McDonald's selling 3 sizes of soft drink?
Yes, actually, for different reasons - McDonald’s charges only a tiny bit more for the largest size of drink than they do for the smallest (which is easy because soft drinks are a few cents’ worth of syrup and water, and the rest is profit). That pushes people toward huge drinks, which means more sugar, more caffeine, and more addiction.
No, it’s not just to use the “decoy effect.” They do this to share development costs across a whole product line. Low volume, expensive products are subsidized by high volume, mass market devices. Without these tiers, they’d be unable to differentiate the products and so lose the margins of the high end products (and their entire reason for existing).
Unless you expect Apple to just sell the high end devices at a loss? Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?
> They do this to share development costs across a whole product line. Low volume, expensive products are subsidized by high volume, mass market devices
Usually it’s the other way around. Mass market products have thin margins and are subsidized by high end / B2B products because the customers for those products have infinitely deep pockets.
> Or do you want the high end chips to be sold in the mass market devices and for Apple to just eat the R&D costs?
Literally what Steve Jobs was steadfast in :). One iPhone for everyone. He even insisted on the Plus models carrying no extra features.
Usually it’s the other way around. Mass market products have thin margins and are subsidized by high end / B2B products because the customers for those products have infinitely deep pockets.
That's usually what I've seen, but the M1 MacBook Air came out first and the M1 Pro and Max came out much later.
That's commonly caused by things like low yields for the highest end devices/binning not allowing them to make the numbers of the high end products they need.
> Large AI models have tight resources requirements. You physically can't use X billion parameters without ~X billion ~bytes of memory.
Well, X billion bits times the parameter bit size. For base models, those are generally 32-bit (so 4X bytes), though smaller quantizations ate possible and widely used for public models, and I would assume as a cost measure for closed hosted models as well.
Tierification of AI models is not some business strategy, it is a necessary consequence of the reality that AI is massively compute constrained right now. The size of a model is extremely important for inference time and cost. It just doesn't make sense to release one single model when your method will always yield a family of models with increasing size. The customer can choose a model corresponding to their needs.
I think the expensive ones are used when the customer is the user — e.g. ChatGPT Plus (personal) subscription — and the cheap ones when they are not — e.g. customer support service bots.
I'm honestly 100% okay with it as long as it's reasonable and not confusing to customers. (Not saying Apple isn't somewhat; I mean, buying a non-Pro iPhone 15 and not being able to view WebM files feels literally fucking insane, and that's apparently how that works, but that's a rant for a different thread.) In cases like this, presumably the idea isn't actually feature-gating, it's scaling up. AI inference costs compute time, and although I have no idea if the inference occurs on special hardware or not, if it does, I can only presume that scaling up the special hardware to meet demand is challenging and very much not like scaling up e.g. a typical web service.
IMO, Tiers can be useful when they make sense and aren't just for artificial market segmentation.
My guess is they're branding it in this way to obfuscate the number of parameters used, which makes sense because more parameters doesn't necessarily mean a better model. It's kind of like the "number of bits" competition in video game consoles back in the 90s.
I think it depends. It's always worth having a small fast model for some tasks and being able to run it completely offline on a mobile cpu. Maybe not as a chat companion, for for text understanding or indexing all your messages and photos for search, it may be enough.
Technical paper: https://goo.gle/GeminiPaper
Some details:
- 32k context length
- efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019))
- audio input via Universal Speech Model (USM) (Zhang et al., 2023) features
- no audio output? (Figure 2)
- visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022)
- output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b)
- supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF)
I think these are already more details than what we got from OpenAI about GPT4, but on the other side, still only very little details.