How slow? Depending on the task I fear it could be too slow to be useful. I beli...

AnthonyMouse · on March 23, 2024

You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.

And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.

Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.

bawana · on March 23, 2024

Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"

AnthonyMouse · on March 24, 2024

Yeah the sentence is backwards, you divide the system's memory bandwidth by the size of the model.