I'm a little confused how these models run/fit onto VRAM. I have 32gb system RAM and 16gb VRAM. I can fit the 20b model all within vram, but then I can't increase the context window size past 8k tokens or so. Trying to max the context size leads to running out of VRAM. Can't it use my system ram as backup though?
Yet I see other people with less resources like 10GB of vram and 32gb system ram fitting the 120b model onto their hardware.
Perhaps its because ROCm isn't really supported by ollama for RDN4 architecture yet? I believe I'm using vulkan to currently run and it seems to use my CPU more than my GPU at the moment. Maybe I should just ask it all this.
I'm not complaining too much because it's still amazing I can run these models. I just like pushing the hardware to its limit.
It seems you'll have to offload more and more layers to system RAM as your maximum context size increases. llama.cpp has an option to set the number of layers that should be computed on the GPU, whereas ollama tries to tune this automatically. Ideally though, it would be nice if the system ram/vram split could simply be readjusted dynamically as the context grows throughout the session. After all, some sessions may not even reach maximum size so trying to allow for a higher maximum ends up leaving valuable VRAM space unused during shorter sessions.
Ah I see interesting, I'll have to play around with this more. I switched from Nvidia to AMD and have found AMD support to still be rolling out for these new cards. I could only get LM studio working so far but I'd like to try out more front ends.
Not a major setback because for long context I'd just use GPT or claude, but it would be cool to have 128k context locally on my machine. When I get a new CPU I'll upgrade RAM to 64, my GPU is more than capable of what I need for a while and a 5090 or 4090 is the next step up in VRAM but I don't want to shell out 2k for a card.
Yet I see other people with less resources like 10GB of vram and 32gb system ram fitting the 120b model onto their hardware.
Perhaps its because ROCm isn't really supported by ollama for RDN4 architecture yet? I believe I'm using vulkan to currently run and it seems to use my CPU more than my GPU at the moment. Maybe I should just ask it all this.
I'm not complaining too much because it's still amazing I can run these models. I just like pushing the hardware to its limit.