Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pretty rough.

Using LM Studio, trying to load the model throws an error of "insufficient system resources."

I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.

And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.

While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).



There's a terminal command to increase the maximum vram MacOS can use, you can try that as you're probably going over the limit and the system is resorting to treat as system ram. (I ran into this problem a couple of times using ollama).


I had the same (compiled llama.cpp myself). Changed it to all CPU I think (num layers on GPU to 0) and it went up to 1.8 tokens per second. I think it can go up much more


Maybe VLLM is better at inferencing MoE (also you can set the number of experts to use).

In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: