Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

(Disclaimer: I am one of the authors of the project) Thank you for the thoughtful and insightful comment. I really love the depth of your first paragraph. You highlighted a concern in this space that is often overlooked, and I am glad you raised it. We spent a significant amount of time dealing with the cost of dynamic GPU memory operations.

One useful observation is that LLM inference has almost no host API calls during steady state, since the GPU must stay busy with continuous kernel launches or CUDA graph replay. You are absolutely right that CUDA and HIP virtual memory operations are expensive on the host side and involve heavy driver work. However, they introduce only small stalls in the GPU pipeline, because most of the cost is paid on the host. These operations are also relatively infrequent compared to kernel launches in practice, so we offload them to a background thread to keep them off the critical path. The APIs are not cheap in general, but they happen to fit LLM inference surprisingly well.

On your second point, I guess I follow your idea, although please correct me if I misunderstood. Virtual memory does open the door to paging and offloading, which is also important for LLM systems. We are actively working on this direction in kvcached. Your defragmentation point also reminds me of classic techniques such as compaction and garbage collection. They could certainly help, though I guess the trade off between benefit and complexity would need more careful evaluation.

Thank you again for the thoughtful analysis. It was a pleasure to read. I would be happy to continue the discussion.





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: