Hacker Newsnew | past | comments | ask | show | jobs | submit | kcorbitt's commentslogin

And lately, the sweet spot has been moving upwards every 6-8 weeks with the model release cycle.

Is it?


Dang, hadn't seen that. Namespace collision strikes again.


yeah unforutnately for you this is one of the well known long context benchmarks. too late tho, soldier on.


I really like RLPR for when you have a known-good answer to compare to as well!


No, we don't do anything. Theoretically we could judge several times with different ordering.

We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!


Thank! If there are any topics that you'd find particularly interesting, let me know and I can try to find time. :)


Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?


This is exactly what llm-d is


I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).

https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...


The real answer is that nobody trusts their automated evals enough to be confident that any given automatically-trained release actually improves performance, even if eval scores go up. So for now everyone batches up updates and vibe-checks them before rolling them out.


It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.

And of course the effect on throughput at larger batch sizes, which they allude to at the end.

Overall a very interesting result!


This could also give a nice speedup for MoE models w/ total 7B-70B parameters but O(10x) fewer active params, e.g. https://huggingface.co/Qwen/Qwen3-30B-A3B, assuming the expert router can be effectively scheduled within the monolithic mega-kernel.


They are reducing forward pass time from say 1.5ms to 1ms. On bigger model you would likely reduce from 15ms to 14.2ms or something like that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: