How is it that cloud LLMs can be so much cheaper? Especially given that local co...

NeutralCrane · 2025-02-01T14:14:11 1738419251

My guess is two things:

1. Economies of scale. Cloud providers are using clusters in the tens of thousands of GPUs. I think they are able to run inference much more efficiently than you would be able to in a single cluster just built for your needs.

2. As you mentioned, they are selling at a loss. OpenAI is hugely unprofitable, and they reportedly lose money on every query.

boredatoms · 2025-02-01T16:31:45 1738427505

The purchase price for a H100 is dramatically lower when you buy a few thousand at a time

thijson · 2025-02-01T15:05:59 1738422359

I think batch processing of many requests is cheaper. As each layer of the model is loaded into cache, you can put through many prompts. Running it locally you don't have that benefit.

michpoch · 2025-02-11T18:02:39 1739296959

> Especially given that local compute, RAM, and storage are often orders of magnitude cheaper than cloud

He uses old, much less efficient GPUs.

He also did not select his living location based on the electricity prices, unlikely the cloud providers.

realusername · 2025-02-01T13:24:55 1738416295

It's cheaper because you are unlikely to run your local AI at top capacity 24/7 so you have unused capacity which you are paying for.

davrosthedalek · 2025-02-01T14:10:28 1738419028

The calculation shows it's cheaper even if you run local AI 24/7

NeutralCrane · 2025-02-01T14:10:27 1738419027

They are specifically referring to usage of APIs where you just pay by the token, not by compute. In this case, you aren’t paying for capacity at all, just usage.

octacat · 2025-02-01T13:58:29 1738418309

It is shared between users and better utilized and optimized.

otabdeveloper4 · 2025-02-01T16:07:20 1738426040

"Sharing between users" doesn't make it cheaper. It makes it more expensive due to the inherent inefficiencies of switching user contexts. (Unless your sales people are doing some underhanded market segmentation trickery, of course.)

manmal · 2025-02-01T17:01:54 1738429314

No, batched inference can work very well. Depending on architecture, you can get 100x or even more tokens out of the system if you feed it multiple requests in parallel.

api · 2025-02-01T20:13:22 1738440802

Couldn't you do this locally just the same?

Of course that doesn't map well to an individual chatting with a chat bot. It does map well to something like "hey, laptop, summarize these 10,000 documents."

manmal · 2025-02-02T09:14:37 1738487677

Yes, and people do that. Some people get thousands of tokens per second that way, with affordable setups (eg 4x 3090). I was addressing GP who said there is no economies of scale to having multiple users.

agieocean · 2025-02-01T13:19:26 1738415966

Isn't that just because they can get massive discounts on hardware buying in bulk (for lack of a proper term) + absorb losses?

raducu · 2025-02-01T13:55:36 1738418136

All that, but also because they have those GPUs with crazy amounts of RAM and crazy bandwidth? So the TPS is that much higher, but in terms of power, I guess those boards run in the same ballpark of power used by consumer GPUs?