There are some interesting hacks you can do like replicating the K/V weights by some factor which allows them to be evenly divisible by whatever number of gpus you have. Obviously there is a memory cost there, but it does work.
how could you go about testing this on say llama3 70b with two 4090s - vllm supports tensor parallelism, would the expectation be that inference would be faster with p2p? How would you update the nvidia driver tho? Thanks any thought appreciated