Pretty rough. Using LM Studio, trying to load the model throws an error of "insu...

htk · 2025-01-29T01:48:52 1738115332

There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).

xiphias2 · 2025-01-31T03:28:22 1738294102

I had the same (compiled llama.cpp myself). Changed it to all CPU I think (num layers on GPU to 0) and it went up to 1.8 tokens per second. I think it can go up much more

xiphias2 · 2025-01-29T13:30:11 1738157411

Maybe VLLM is better at inferencing MoE (also you can set the number of experts to use).

In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.