I love the original DeepSeek model, but the distilled versions are too dumb usually.
Apart from being dumber, they also don't know as much as R1. I can see how fine-tuning can improve reasoning capability (by showing examples of good CoT) but there's no reason that would improve the knowledge of facts (relative to the Qwen or Llama model on which the finetuning was based).
Using LM Studio, trying to load the model throws an error of "insufficient system resources."
I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.
And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.
While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).
There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).
I had the same (compiled llama.cpp myself). Changed it to all CPU I think (num layers on GPU to 0) and it went up to 1.8 tokens per second. I think it can go up much more
I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.