Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are some interesting hacks you can do like replicating the K/V weights by some factor which allows them to be evenly divisible by whatever number of gpus you have. Obviously there is a memory cost there, but it does work.


how could you go about testing this on say llama3 70b with two 4090s - vllm supports tensor parallelism, would the expectation be that inference would be faster with p2p? How would you update the nvidia driver tho? Thanks any thought appreciated




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: