Hacker News new | past | comments | ask | show | jobs | submit login

In general, how do you run these big models on cloud hardware? Do you cut them up layer-wise and run slices of layers on individual A100/H100s?



My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token.

So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate.

Though, I suspect it's normal to stick on one MoE subset for several output tokens.

This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs.


Each MoE layer has its own router, and it activates 8 (out of 256) experts at a time. There's no reason to expect all of them to stay on the same GPU, so you're pretty much guaranteed to have to do all-to-all communication between the GPUs in your cluster after every layer for every token.


Interesting.

I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.

I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.

It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.


The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.


The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.


There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!


You could do that, and add pipelining to improve speed.


Was wondering the same, but for HPC clusters :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: