Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands (theregister.com)
45 points by mfiguiere on Aug 23, 2024 | hide | past | favorite | 11 comments


> "even an old Nvidia RTX 3090"

Saying that like it's mediocre... Maybe I'll have to benchmark my old 1050, see what it can do!


How does 12 tokens/sec equate to satisfactorily severing thousands of end-customere?

I did enjoy the headline, and the register has formerly been an incredibly good news outlet. Prior to the founder passing away last week.


From the article, it's not 12 total, but 12 per user for 100 concurrent generations.


Thank you atherton friend.


Capacity doesn't equate to quality, but I could easily see an 8B finetune with exl2 at low context working for short, simple customer interactions, akin to oversubscribing a 1Gbit uplink for 100 customers at 50Mbps.


This is wildly misleading as the benchmarks make use of batching. It will entirely fall apart in real workloads where each prompt is different. If you're doing batch processing with a fixed prompt, the results will be more applicable.


It depends. For batching to be viable each prompt has to share some similarities of context/intention, which quite often is the case in specific applications (as opposed to say general chats)


Guess what Nvidia won't let you deploy in a data center!


Pretty sure this was never questioned for batched requests, sg-lang/lmdeploy/tensorRT-LLM will have nearly twice as reported speeds with INT8 (fp16 A100 benched here https://github.com/sgl-project/sglang?tab=readme-ov-file#ben...)


Bought a 3090 because they are good value for this, but this logic is frankly a little ridiculous:

> Since only a small fraction of users are likely to be making requests at any given moment

So what if 5 out of the thousands happen to coincide?


The benchmark result is for 100 concurrent users




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: