Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

People use batching on servers to optimize throughput for the complete server, not for a single session.

See “throughput (tokens/s) versus concurrency” graph in that article: https://www.predera.com/blog/mistral-7b-performance-analysis...

There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.

BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.



The chart you pointed out is very interesting, but it largely supports my point.

The blue line is easiest to read, so let’s look at how the tokens/sec scale for a single user session as the batch size increases. It starts out at about 100 tokens/s for 5 users = 20 tokens/s/user. At the next point, it is about 19t/s/u. Beyond this point, we start losing some ground, but even by the final data point, it is still over 11t/s/u.

The throughput is affected by less than 2x even with the most unreasonably large batch size. (Unreasonable, because the time to first token is unacceptable for an interactive chat, as you pointed out.)

But, with a batch size that is balanced appropriately, the throughput for a single user session is effectively unchanged whether the service is batching at N=3 or N=10. (Or presumably N=1, but the chart doesn’t include that.) The time to first token is also a reasonable 1 second delay, which is similar to what OpenAI is providing in your testing.

So, with the right batching balance, batching increases the total throughput of the server, but does not affect the throughput or latency for any individual session very much. It does have some impact, of course. Model size and quantization seem to have a much larger impact than batching, from an end user standpoint.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: