Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How slow? Depending on the task I fear it could be too slow to be useful.

I believe there is some research on how to distribute large models across multiple GPUs, which could make the cost less lumpy.



You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.

And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.

Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.


Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"


Yeah the sentence is backwards, you divide the system's memory bandwidth by the size of the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: