You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.
And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.
Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.
Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"
I believe there is some research on how to distribute large models across multiple GPUs, which could make the cost less lumpy.