It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.
Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
1MB of context can maybe hold 10 tokens depending on your model.
For reference. llama 3.2 8B used to take 4 KiB per token per layer. At 32 layers that is 128KiB or 8 tokens per MiB of KV cache (context). If your context holds 8000 tokens including responses then you need around 1GB.
>Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.
Matrix vector multiplication implies a single floating point multiplication and addition (2 flops) per parameter. Your GPU can do way more flops than that without using tensor cores at all. In fact, this workload bores your GPU to death.
> I feel that the GPU is still the bottleneck here, not the bus performance.
PCIe bus performance is basically irrelevant.
> Token generation is completely on the card using the memory on the card, without any bus IO at all, no?
Right. But the GPU can't instantaneously access data in VRAM. It has to be copied from VRAM to GPU registers first. For every token, the entire contents of VRAM has to be copied to the GPU to be computed. It's a memory-bound process.
Right now there's about an 8x difference in memory bandwidth between low-end and high-end consumer cards (e.g., 4060 Ti vs 5090). Moving up to a B200 more than doubles that performance again.
GPU memory bandwidth is the limiting factor, not PCIe bandwidth.
The memory bandwidth is critical because the models rely on getting all the parameters from memory to do computation, and there is a low amount of computation per parameter, so memory tends to be the bottleneck.
It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.
Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.