Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super...

heinrichf · 2025-03-12T10:14:02 1741774442

I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).

- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval

- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval

Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?

alekandreev · 2025-03-12T10:33:10 1741775590

Thank you for the report! We are working with the Ollama team directly and will look into it.

remuskaos · 2025-03-13T16:47:56 1741884476

At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.

magicalhippo · 2025-03-12T08:31:49 1741768309

Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.

I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?

alekandreev · 2025-03-12T08:42:07 1741768927

Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.

The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.

magicalhippo · 2025-03-12T09:08:04 1741770484

Thanks again, very interesting.

One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!

Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.

This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.

alekandreev · 2025-03-12T09:52:53 1741773173

Thank you for the feedback! This is why we are so excited to push more and more on small models for both low end and high end smartphones!

bguberfain · 2025-03-12T16:44:57 1741797897

Can you provide more information about this “bigger teacher” model?

miki123211 · 2025-03-12T08:54:14 1741769654

How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?

We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.

canyon289 · 2025-03-12T09:03:38 1741770218

Hey, I'm from the Gemma team. There's a couple of angles to your question

We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.

https://www.youtube.com/watch?v=YxhzozLH1Dk

Multilinguality was a big focus in Gemma3. Give it a try

And for structured output Gemma works well with many structured output libraries, for example the one built into Ollama

https://github.com/ollama/ollama/blob/main/docs/api.md#struc...

In short you should have all the functionality you need!

refulgentis · 2025-03-13T00:33:46 1741826026

The Ollama stuff is the old llama.cpp stuff that constrains output tokens.

It's great, I've used it to get outputs from as small a model as 1B.

But it's a stark difference in quality from, say, Phi-4's native tool-calling.

If Gemma 3 is natively trained on tool-calling, i.e. y'all are benching on say, Berekley Function Calling leaderboard, that'd be great to know out here.

Tangentially, github.com/ochafik is a Googler who landed an excellent overhaul of llama.cpp's tool-calling, might be worth reaching out to (if you're not working with him already!)

eternityforest · 2025-03-14T08:57:16 1741942636

I notice in my (brief and probably user error filled, I'm an embedded dev, not an AI expert) testing, it(and pretty much every other small model) seems to have trouble interpreting numbers expressed as words when filling out a JSON object like:

{"operator": "*", "command": "calculate", "a": 473, "b": 2848}

You might say something like five thousand fifty six, and it will fill in something like 556 or 5560.

Like as if it is just transferring digits one by one, not using the structure to know about the implicit zero.

Which is very interesting since that seems like a mistake I would make too!

It doesn't do it all the time, and I only know about the ollama quantized version, and I mostly only try the 1B models, and I've seen similar issues with other sub-2B models as well.

The other interesting thing is in a chat, almost every model I've tried seems to interpret the numbers correctly, if you say "what's ten million and fifty times eight" it will start with "10,000,050 x 8 is...".

Sometimes they get the math wrong after that, but the number interpretation is right.

I wonder if there's something special about all "intro text" in the chat mode that is actually acting like reasoning, or if the digit separators(that don't exist in JSON) help them figure out what they're doing?

I wonder if it would be better for some applications to include a line of thoughts/summary/intro in the JSON format constraint?

Other than that I've been really enjoying Gemma3!

seektable · 2025-03-12T13:43:22 1741787002

Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):

Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.

Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).

seektable · 2025-03-12T14:09:28 1741788568

looks like Ollama's issue: https://github.com/ollama/ollama/issues/9686, https://github.com/ollama/ollama/issues/9687

swyx · 2025-03-12T08:43:19 1741768999

will there ever be a Gemma 3 Thinking? how copyable is the Flash Thinking approach to the Gemma series?

alekandreev · 2025-03-12T13:03:00 1741784580

That's a very interesting area, but nothing we can announce today.

mdp2021 · 2025-03-12T08:45:49 1741769149

Thank you!

Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?

Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.

alekandreev · 2025-03-12T09:33:16 1741771996

That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.

[1] https://huggingface.co/aiplanet/buddhi-indic

[2] https://ai.google.dev/gemma/gemmaverse/sealion

jjani · 2025-03-13T03:44:05 1741837445

Just wanted to say that Gemini 1.5-Pro is still the SOTA foundational model for certain languages (including non-Google models), so it's disappointing to have received the email that it will be removed in September - it will cause our product quality to go backwards when we're forced to replace it by a worse model. Unless a better one appears in that time, but we've extensively tested all big models and for the languages in question, none of them perform on the same level.

Happy to elaborate if there's a way to get in touch, in case the team isn't aware of this.

mdp2021 · 2025-03-12T09:51:52 1741773112

And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.

alekandreev · 2025-03-12T14:56:30 1741791390

Yes we have measured the tradeoff. We don't see a drop of perplexity in English when introducing multilingual, and there is a slight drop in some English language-specific evals (~1%).

Workaccount2 · 2025-03-12T14:00:51 1741788051

There are enough small model teams competing that I fell confident one of them will try this, and if it just sticking to english gives a large boost, the others will be forced to follow suite.

It would also kind of suck for non-english speakers, because it will just be another feather in the hat of "English eats the world".

mdp2021 · 2025-03-12T15:00:51 1741791651

Some numbers to try and make an idea: if I understand correctly, Gemma3 uses a fixed (in its versions by size) vocabulary 256k entries big; the smallest 1B version has ~300M embedding parameters and ~700M non-embedding parameters; the largest 27B version has ~5x embedding parameters and ~35x non-embedding parameters.

Multilingualism covering 140 languages is quite a big feat. Gemma3 apparently aims to be compact and efficient. The two goals and features put together raise questions. You wonder for example how much does such extensive multilingualism impact the above numbers, on a benchmark of similar results. It may e.g. be a general question to wonder how much multilingualism complicates an embedding space (owing e.g. to omographic collisions), and the question becomes more prominent when you crammed 140 languages in one model.

> non-english speakers

You would produce more specialized models (where it makes sense): Eng; Eng-Fra-Esp-Deu; Man-Can... For a billion weights per model it could probably be financially acceptable.

sidkshatriya · 2025-03-12T14:32:32 1741789952

As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).

Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?

If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?

alekandreev · 2025-03-12T14:55:07 1741791307

We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.

sidkshatriya · 2025-03-12T15:18:45 1741792725

Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]

moffkalast · 2025-03-12T13:17:17 1741785437

What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?

alekandreev · 2025-03-12T14:17:12 1741789032

We recommend using <start_of_turn>user for the system prompt as well.

tucnak · 2025-03-12T14:51:27 1741791087

I was under the impression that the purpose of "system" prompt is to encode the instruction boundary explicitly to reduce the risk of injection. Do you enforce some kind of security invariant that we could rely on? For example, does the alignment regiment include adversarial demonstrations so that out-of-order instruction-following (such as contradicting preceding) is penalised?

werediver · 2025-03-13T12:23:44 1741868624

Is speculative decoding possible across 1/4/12/27 B Gemma 3 variants?

LM Studio doesn't allow that (yet), but maybe the s/w requires some adjustments to support speculative decoding with Gemma 3.

pinglin · 2025-03-24T04:34:49 1742790889

It's reported working but not with LM Studio: https://www.reddit.com/r/LocalLLaMA/comments/1j9reim/comment...

Herring · 2025-03-12T16:57:39 1741798659

Excellent work. What optimizer did you use? I assume AdamW? I didn't see it listed.

saagarjha · 2025-03-12T15:58:44 1741795124

Google is using Greenhouse for ATS now?

nothrowaways · 2025-03-12T19:58:35 1741809515

Is this what powers Gemini?