Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

Cordiali · on Aug 24, 2024

> "even an old Nvidia RTX 3090"

Saying that like it's mediocre... Maybe I'll have to benchmark my old 1050, see what it can do!

metadat · on Aug 23, 2024

How does 12 tokens/sec equate to satisfactorily severing thousands of end-customere?

I did enjoy the headline, and the register has formerly been an incredibly good news outlet. Prior to the founder passing away last week.

atherton33 · on Aug 24, 2024

From the article, it's not 12 total, but 12 per user for 100 concurrent generations.

metadat · on Aug 24, 2024

Thank you atherton friend.

washadjeffmad · on Aug 24, 2024

Capacity doesn't equate to quality, but I could easily see an 8B finetune with exl2 at low context working for short, simple customer interactions, akin to oversubscribing a 1Gbit uplink for 100 customers at 50Mbps.

stevenhuang · on Aug 24, 2024

This is wildly misleading as the benchmarks make use of batching. It will entirely fall apart in real workloads where each prompt is different. If you're doing batch processing with a fixed prompt, the results will be more applicable.

PeterStuer · on Aug 24, 2024

It depends. For batching to be viable each prompt has to share some similarities of context/intention, which quite often is the case in specific applications (as opposed to say general chats)

fooblaster · on Aug 24, 2024

Guess what Nvidia won't let you deploy in a data center!

iAkashPaul · on Aug 24, 2024

Pretty sure this was never questioned for batched requests, sg-lang/lmdeploy/tensorRT-LLM will have nearly twice as reported speeds with INT8 (fp16 A100 benched here https://github.com/sgl-project/sglang?tab=readme-ov-file#ben...)

Havoc · on Aug 26, 2024

Bought a 3090 because they are good value for this, but this logic is frankly a little ridiculous:

> Since only a small fraction of users are likely to be making requests at any given moment

So what if 5 out of the thousands happen to coincide?

kristoo · on Aug 26, 2024

The benchmark result is for 100 concurrent users