Hacker Newsnew | past | comments | ask | show | jobs | submit | dramlord's commentslogin

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding


Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: