> You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).
You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.
It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.
Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo
You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.
It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.
Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo