In Ollama, Gemma:9b works fine, but 27b seems to be producing a lot of nonsense ...

thot_experiment · on June 28, 2024

Had a chance to do some testing and it seems quite good on oneshot tasks with a small context window but as you approach context saturation it starts to go way off the rails. Maybe this is an implementation issue? I'm using Q6_K quants of both sizes in ollama. I'll report back if I figure it out.

A larger context window really helps on RAG tasks, it's frustrating that a lot of the foundational models have such small windows.

jmorgan · on June 28, 2024

Sorry about this – working on fixing the issue with hitting the context limit. Gemma 2 supports a 8192 context limit – which can be selected if you provide the `num_ctx` parameter in the API or via `ollama run` with `/set parameter num_ctx 8192`

thot_experiment · on June 28, 2024

Thanks! If you have a moment can you give me a quick explainer on what happens when you hit the context limit in ollama? I had assumed that ollama would just trunc the context to whatever is set in the model, but I guess this isn't the case?

jmorgan · on June 28, 2024

Currently when the context limit is hit, there's a halving of the context window (or a "context shift") to allow inference to continue – this is helpful for smaller (e.g. 1-2k) context windows.

However, not all models (especially newer ones) respond well to this, which makes sense. We're working on changing the behavior in Ollama's API to be more similar to OpenAI, Anthropic and similar APIs so that when the context limit is hit, the API returns a "limit" finish/done reason. Hope this is helpful!

brandall10 · on June 27, 2024

27b is working fine for me, hosted on ollama w/ continue.dev in VSCode.

bugglebeetle · on June 27, 2024

The tokenizer in llama.cpp probably needs fixing then or it has some other bug.

0x7cfe · on July 1, 2024

Definitely. I tried gemma2:27B model with phrases like "translate the following sentence to language X" and it even failed to understand the task and spat out completely irrelevant things, like math formulas.

OTOH, smaller model did it perfectly.