That's what happens when consumer demand rapidly shifts, and businesses start panic-buying and panic-cancelling. As far as I recall, actual chip fab output didn't really change that much.
I ask ChatGPT about this. It says the root was demand collapse at the start of COVID. So fabs stopped producing the many low-end chips reqd for modern cars. They retooled/pivoted to higher-end chips. When auto manufs came back knocking after COVID, the fabs didn't want/need their biz of low-end chips.
Don’t confuse inference (api usage) with the consumer plan products. When people say inference is profitable they are referring to the cost to serve a token via the API. The consumer products are absolutely a question mark on profitability and as we see with most of the business and enterprise plans, going away for pure on demand use (api cost) full time.
Profitability doesn't imply infinite ability to scale. Of course they will want to prioritize their most profitable customers when they hit capacity issues.
Those are subscription plans. They tweaked the limits/periods included in the subscription. Having higher limits for subscription plans didn't give them any more revenue.
They do it because their demand is higher than the compute that they have available to them. Their GPUs must be melting during peak hours so they're encouraging people who move their workload to off peak hours if possible.
Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B.
You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens.
You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario.
In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price.
3.99 at 8x instances, with a minimum 2 week commitment. Good luck getting 70% usage average during that time. Useful when you're running a training round and can properly gauge demand, not so great when you're offering an API.
It says the numbers are theoretically possible. Requiring a 66% usage to break even when 100% usage will piss off customers by invoking a queue means it’s a balancing act.
“Technically correct. The best kind of correct”. So inference may technically be _capable_ of being profitable, but I have question’s about them being profitable in _practice_.
reply