Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will...

rahimnathwani · 2025-01-28T20:09:26 1738094966

  I love the original DeepSeek model, but the distilled versions are too dumb usually.

Apart from being dumber, they also don't know as much as R1. I can see how fine-tuning can improve reasoning capability (by showing examples of good CoT) but there's no reason that would improve the knowledge of facts (relative to the Qwen or Llama model on which the finetuning was based).

prisenco · 2025-01-28T17:27:31 1738085251

I'm downloading it now and will report back.

(I've been using the 32B and while it could always be better, I'm not unhappy with it)

TheTaytay · 2025-01-28T23:00:07 1738105207

How'd it go, and which client are you using? :)

prisenco · 2025-01-28T23:52:19 1738108339

Pretty rough.

Using LM Studio, trying to load the model throws an error of "insufficient system resources."

I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.

And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.

While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).

htk · 2025-01-29T01:48:52 1738115332

There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).

xiphias2 · 2025-01-31T03:28:22 1738294102

I had the same (compiled llama.cpp myself). Changed it to all CPU I think (num layers on GPU to 0) and it went up to 1.8 tokens per second. I think it can go up much more

xiphias2 · 2025-01-29T13:30:11 1738157411

Maybe VLLM is better at inferencing MoE (also you can set the number of experts to use).

In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.