I’m trying to figure out the cost predictability angle here. It seems like they ...

jameswhitford · 2025-07-22T10:05:32 1753178732

Serverless setups (like Cerebrium) charge per second the model is running, its not token based.

BoorishBears · 2025-07-22T11:47:34 1753184854

You're still paying more than the GPU typically costs on an hourly basis to take advantage of their per-second billing... and if you don't have enough utilization to saturate an hourly rental then your users are going to be constantly running into cold starts which tend to be brutal for larger models.

Their A100 80GB is going more than what I pay to rent H100s: if you really want to save money, getting the cheapest hourly rentals possible is the only way you have any hope of saving money vs major providers.

I think people vastly underestimate how much companies like OpenAI can do with inference efficiency between large nodes, large batch sizes, and hyper optimized inference stacks.

ivape · 2025-07-22T12:16:31 1753186591

I'll echo one of my original concerns, which is how is this supposed to scale? Am I responsible for that?

BoorishBears · 2025-07-22T14:03:26 1753193006

How is what supposed to scale?

If you mean the serverless GPU offering, typically you set a cap for how many requests a single instance is meant to serve. Past that cap they'll spin up more instances.

But if you mean rentals, scaling is on you. With LLM inference there's a regime where the model responses will slow down on a per-user basis while overall throughput goes up, but eventually you'll run out of headroom and need more servers.

Another reason why generally speaking it's hard to compete with major providers on cost effectiveness.

ivape · 2025-07-22T14:16:32 1753193792

Past that cap they'll spin up more instances.

Thank you, this is what I wanted to know.

typically you set a cap for how many requests a single instance is meant to serve

If this is on us then we'd have to make sure whatever caps we set beat api providers. I don't know how easy that cap is to figure out.

BoorishBears · 2025-07-22T15:20:50 1753197650

If you're making the effort-cost tradeoff like this, you typically choose a model, test a few inference stacks with prompts that are representative lengths for your use case, then benchmark.

To benchmark you identify a maximum time to first token your users will accept, and minimum tokens per second they'll accept, then test how many concurrent requests you can handle before you exceed either limit.

I can tell you, in my case the only reason why the pricing is somewhat competitive for self-hosting is that I'm aggressively seeking cheap rentals, have a use-case that requires very long prompts with few cache hits, and I've used extensive (and expensive) post-training to deploy smaller models than I'd otherwise need.

ivape · 2025-07-22T10:13:22 1753179202

Ah you’re right, misread the OpenAI/cerbrium pricing config variables.