[I'm still not understanding] It seems to me that *even if* you pass in a long c...

imtringued · 2025-06-09T08:17:33 1749457053

1MB of context can maybe hold 10 tokens depending on your model.

For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.

>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.

Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.

zargon · 2025-06-09T09:59:13 1749463153

> I feel that the GPU is still the bottleneck here, not the bus performance.

PCIe bus performance is basically irrelevant.

> Token generation is completely on the card using the memory on the card, without any bus IO at all, no?

Right. But the GPU can't instantaneously access data in VRAM. It has to be copied from VRAM to GPU registers first. For every token, the entire contents of VRAM has to be copied to the GPU to be computed. It's a memory-bound process.

Right now there's about an 8x difference in memory bandwidth between low-end and high-end consumer cards (e.g., 4060 Ti vs 5090). Moving up to a B200 more than doubles that performance again.

jononor · 2025-06-09T06:37:50 1749451070

GPU memory bandwidth is the limiting factor, not PCIe bandwidth. The memory bandwidth is critical because the models rely on getting all the parameters from memory to do computation, and there is a low amount of computation per parameter, so memory tends to be the bottleneck.