No, we don't do anything. Theoretically we could judge several times with different ordering.
We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!
Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?
Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?
I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).
The real answer is that nobody trusts their automated evals enough to be confident that any given automatically-trained release actually improves performance, even if eval scores go up. So for now everyone batches up updates and vibe-checks them before rolling them out.
It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.
And of course the effect on throughput at larger batch sizes, which they allude to at the end.
This could also give a nice speedup for MoE models w/ total 7B-70B parameters but O(10x) fewer active params, e.g. https://huggingface.co/Qwen/Qwen3-30B-A3B, assuming the expert router can be effectively scheduled within the monolithic mega-kernel.
reply