In general, how do you run these big models on cloud hardware? Do you cut them u...

phire · 2025-01-28T10:01:10 1738058470

My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.

So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.

Though, I suspect it's normal to stick on one MoE subset for several output tokens.

This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.

yorwba · 2025-01-28T10:43:10 1738060990

Each MoE layer has its own router, and it activates 8 (out of 256) experts at a time. There's no reason to expect all of them to stay on the same GPU, so you're pretty much guaranteed to have to do all-to-all communication between the GPUs in your cluster after every layer for every token.

phire · 2025-01-28T12:01:51 1738065711

Interesting.

I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.

I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.

It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.

yorwba · 2025-01-28T14:08:49 1738073329

The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.

iamnotagenius · 2025-01-28T14:22:50 1738074170

The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.

danielhanchen · 2025-01-28T10:59:32 1738061972

There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!

amelius · 2025-01-28T09:49:29 1738057769

You could do that, and add pipelining to improve speed.

teekert · 2025-01-28T09:42:53 1738057373

Was wondering the same, but for HPC clusters :)