I don't think P2P is very relevant for inference. It's important for training. I...

andersa · on April 12, 2024

It can make a difference when using tensor parallelism to run small batch sizes. Not a huge difference like training because we don't need to update all weights, but still a noticeable one. In the current inference engines there are some allreduce steps that are implemented using nccl.

Also, paged KV cache is usually spread across GPUs.

namibj · on April 12, 2024

It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU. Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory. It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.

qeternity · on April 12, 2024

You are talking about data parallelism. Depending on the model tensor parallelism can still be very important for inference.