In Ollama, Gemma:9b works fine, but 27b seems to be producing a lot of nonsense for me. Asking for a bit of python or JavaScript code rapidly devolves into producing code-like gobbledegook, extending for hundreds of lines.
Had a chance to do some testing and it seems quite good on oneshot tasks with a small context window but as you approach context saturation it starts to go way off the rails. Maybe this is an implementation issue? I'm using Q6_K quants of both sizes in ollama. I'll report back if I figure it out.
A larger context window really helps on RAG tasks, it's frustrating that a lot of the foundational models have such small windows.
Sorry about this – working on fixing the issue with hitting the context limit. Gemma 2 supports a 8192 context limit – which can be selected if you provide the `num_ctx` parameter in the API or via `ollama run` with `/set parameter num_ctx 8192`
Thanks! If you have a moment can you give me a quick explainer on what happens when you hit the context limit in ollama? I had assumed that ollama would just trunc the context to whatever is set in the model, but I guess this isn't the case?
Currently when the context limit is hit, there's a halving of the context window (or a "context shift") to allow inference to continue – this is helpful for smaller (e.g. 1-2k) context windows.
However, not all models (especially newer ones) respond well to this, which makes sense. We're working on changing the behavior in Ollama's API to be more similar to OpenAI, Anthropic and similar APIs so that when the context limit is hit, the API returns a "limit" finish/done reason. Hope this is helpful!
Definitely. I tried gemma2:27B model with phrases like "translate the following sentence to language X" and it even failed to understand the task and spat out completely irrelevant things, like math formulas.