dramlord's comments

dramlord · on Dec 17, 2023

You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding

qeternity · on Dec 18, 2023

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.