I'm a little confused how these models run/fit onto VRAM. I have 32gb system RAM...

zozbot234 · 2025-08-11T18:02:10 1754935330

It seems you'll have to offload more and more layers to system RAM as your maximum context size increases. llama.cpp has an option to set the number of layers that should be computed on the GPU, whereas ollama tries to tune this automatically. Ideally though, it would be nice if the system ram/vram split could simply be readjusted dynamically as the context grows throughout the session. After all, some sessions may not even reach maximum size so trying to allow for a higher maximum ends up leaving valuable VRAM space unused during shorter sessions.

leach · 2025-08-11T21:17:08 1754947028

Ah I see interesting, I'll have to play around with this more. I switched from Nvidia to AMD and have found AMD support to still be rolling out for these new cards. I could only get LM studio working so far but I'd like to try out more front ends.

Not a major setback because for long context I'd just use GPT or claude, but it would be cool to have 128k context locally on my machine. When I get a new CPU I'll upgrade RAM to 64, my GPU is more than capable of what I need for a while and a 5090 or 4090 is the next step up in VRAM but I don't want to shell out 2k for a card.