There are some interesting hacks you can do like replicating the K/V weights by ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		spmurrayzzz on April 13, 2024 \| parent \| context \| favorite \| on: Hacked Nvidia 4090 GPU driver to enable P2P There are some interesting hacks you can do like replicating the K/V weights by some factor which allows them to be evenly divisible by whatever number of gpus you have. Obviously there is a memory cost there, but it does work.

Emerson1 on April 27, 2024 [–]

how could you go about testing this on say llama3 70b with two 4090s - vllm supports tensor parallelism, would the expectation be that inference would be faster with p2p? How would you update the nvidia driver tho? Thanks any thought appreciated

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact