I don't think P2P is very relevant for inference. It's important for training. Inference can just be sharded across GPUs without sharing memory between them directly.
It can make a difference when using tensor parallelism to run small batch sizes. Not a huge difference like training because we don't need to update all weights, but still a noticeable one. In the current inference engines there are some allreduce steps that are implemented using nccl.
Also, paged KV cache is usually spread across GPUs.
It massively helps arithmetic intensity to batch during inference, and the desired batch sizes by that tend to exceed the memory capacity of a single GPU.
Thus desire to do training-like cluster processing to e.g. use a weight for each inference stream that needs it every time it's fetched from memory.
It's just that you can't fit 100+ inference streams of context on one GPU, typically, thus the desire to shard along less-wasteful (w.r.t. memory bandwidth) dimensions than entire inference streams.