What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink.
There is no expectation that phones will ever be comparable in performance for LLMs.
Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.
Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.
Running models locally is not motivated by performance, except if you’re in places without reliable internet.
These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server.
That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.
Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.
For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.
> These data center targeted GPUs can only output that many tokens per second for large batches.
No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.
At larger batch sizes, the token rate would be enormous.
Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.
We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.
I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.
Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
> I wonder are you using a quantized version of Mistral?
Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.
> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?
We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.
There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.
BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.
The chart you pointed out is very interesting, but it largely supports my point.
The blue line is easiest to read, so let’s look at how the tokens/sec scale for a single user session as the batch size increases. It starts out at about 100 tokens/s for 5 users = 20 tokens/s/user. At the next point, it is about 19t/s/u. Beyond this point, we start losing some ground, but even by the final data point, it is still over 11t/s/u.
The throughput is affected by less than 2x even with the most unreasonably large batch size. (Unreasonable, because the time to first token is unacceptable for an interactive chat, as you pointed out.)
But, with a batch size that is balanced appropriately, the throughput for a single user session is effectively unchanged whether the service is batching at N=3 or N=10. (Or presumably N=1, but the chart doesn’t include that.) The time to first token is also a reasonable 1 second delay, which is similar to what OpenAI is providing in your testing.
So, with the right batching balance, batching increases the total throughput of the server, but does not affect the throughput or latency for any individual session very much. It does have some impact, of course. Model size and quantization seem to have a much larger impact than batching, from an end user standpoint.
There is no expectation that phones will ever be comparable in performance for LLMs.
Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.
Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.
Running models locally is not motivated by performance, except if you’re in places without reliable internet.