Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/awnihannun/status/1773024954667184196?s=20


I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.


I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.


From the article, should have the speed of a ~36b.


And it appears to be at ~80 GB of RAM via quantisation.


So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful


Can't wait to try this on my MacBook. I'm also just amazed at how wasteful Grok appears to be!


That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?


Are you aware how Macs present memory? Their 'unified' memory approach means you could run an 80GB model on a 128GB machine.

There's no concept of 'dedicated GPU memory' as per conventional amd64 arch machines.


What?

Are you asking if the framework automatically quantizes/prunes the model on the fly?

Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.


I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.

I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.


Make your dreams a reality!


Worst is software that doesn't complain but fails silently.


The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.


That's great, but it did not really write the program that the human asked it to do. :)


That's because it's the base model, not the instruct tuned one.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: