Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/aw...

resource_waste · on March 27, 2024

I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.

irusensei · on March 28, 2024

I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.

marci · on March 28, 2024

From the article, should have the speed of a ~36b.

Mandelmus · on March 27, 2024

And it appears to be at ~80 GB of RAM via quantisation.

smcleod · on March 27, 2024

So that would be runnable on a MBP with a M2 Max, but the context window must be quite small, I don’t really find anything under about 4096 that useful

a_wild_dandan · on March 28, 2024

Can't wait to try this on my MacBook. I'm also just amazed at how wasteful Grok appears to be!

dheera · on March 27, 2024

That's a tricky number. Does it run on an 80GB GPU, does it auto-shave some parameters to fit in 79.99GB like any articifially "intelligent" piece of code would do, or does it give up like an unintelligent piece of code?

Jedd · on March 28, 2024

Are you aware how Macs present memory? Their 'unified' memory approach means you could run an 80GB model on a 128GB machine.

There's no concept of 'dedicated GPU memory' as per conventional amd64 arch machines.

declaredapple · on March 27, 2024

What?

Are you asking if the framework automatically quantizes/prunes the model on the fly?

Or are you suggesting the LLM itself should realize it's too big to run, and prune/quantize itself? Your references to "intelligent" almost leads me to the conclusion that you think the LLM should prune itself. Not only is this a chicken and egg problem, but LLMs are statistical models, they aren't inherently self bootstraping.

dheera · on March 27, 2024

I realize that, but I do think it's doable to bootstrap it on a cluster and teach itself to self-prune, and surprised nobody is actively working on this.

I hate software that complains (about dependencies, resources) when you try to run it and I think that should be one of the first use cases for LLMs to get L5 autonomous software installation and execution.

Red_Leaves_Flyy · on March 27, 2024

Make your dreams a reality!

lobocinza · on April 3, 2024

Worst is software that doesn't complain but fails silently.

2099miles · on March 28, 2024

The LLM itself should realize it’s too big and only put the important parts on the gpu. If you’re asking questions about literature there’s no need to have all the params on the gpu, just tell it to put only the ones for literature on there.

madiator · on March 27, 2024

That's great, but it did not really write the program that the human asked it to do. :)

SparkyMcUnicorn · on March 27, 2024

That's because it's the base model, not the instruct tuned one.