Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not really timing it as I just use these models via open webui, nvim and a few things I've made like a discord bot, everything going via ollama.

But for comparison, it is generating tokens about 1.5 times as fast as gemma 3 27B qat or mistral-small 2506 q4. Prompt processing/context however seems to be happening at about 1/4 of those models.

A bit more concrete of the "excellent", I can't really notice any difference between the speed of oss-120b once the context is processed and claude opus-4 via api.



I've found threads online that suggest that running gpt-oss-20b on ollama is slow for some reason. I'm running the 20b model via LM Studio on a 2021 M1 and I'm consistently getting around 50-60 T/s.


Pro tip: disable the title generation feature or set it to another model on another system.

After every chat, open webui is sending everything to llamacpp again wrapped in a prompt to generate the summary, and this wipes out the KV cache, forcing you to reprocess the entire context.

This will get rid of the long prompt processing times id you're having long back and forth chats with it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: