I don't for a minute believe Deepseek v3 was built with a $6M rental.
Their paper (arxiv 2412:1947) explains they used 2048 H800s. A computer cluster based on 2048 GPUs would have cost around $400M about two years ago when they built it. (Give or take, feel free to post corrections.)
The point is they got it done cheaper than OpenAI/Google/Meta/... etc.
But not cheaply.
I believe the markets are overreacting. Time to buy (tinfa).
They pointed out that the cost calculation is based on if those GPUs were rented at $2/hr. They are not factoring in the prior cost of buying those H800s because they didn't buy it to build R1. They are not factoring in the cost to build v2, or v2.5. The cost is to build V3. The cost to build R0 and R1 on top of v3, seems far cheaper and they didn't mention that. They are not factoring in the cost to build out their datacenter or salary. Just the training cost. They made it clear. If you could rent equivalent GPUs at $2/hr, it would cost you about $6million.
"Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data."
V3 was released a bit a month ago, V3 is not what took the world by storm but R1. The price everyone is talking about is the price for V3.
If this weren't an attempt to sell a false equivalency, at least one story would have details on the equivalent rental cost of compute used to train closed source frontier models from OpenAI, Anthropic, Mistral... Lack of clarity makes it a story.
>>Just the training cost. They made it clear. If you could rent equivalent GPUs at $2/hr, it would cost you about $6million.
This is still quite impressive, given most people are likely to buy cloud infrastructure from AWS or Azure than build their own datacenter. So the Math checks out.
I don't think compute capacity built already will go waste, likely more and bigger things will get built in the coming years so most of it will be used for that purpose.
You’re confusing the metric for reality. The point is to compare the cost of training in terms of node hours with a given configuration. That’s how you get apples to apples. Of course it doesn’t cover building the cluster, housing the machine, the cleaning staff’s pension, or whatever.
The math they gave was 2,788,000 H800 GPU hours[1], with a rental price of $2/GPU-hour[1], which works out to $5.6M. If they did that on a cluster of 2048 H800s, then they could re-train the model every ~1400 hours (~2 months).
If they paid $70,000 per GPU[2] plus $5000 per 4-GPU compute node (random guess), then the hardware would have cost about $150M to build. If you add in network hardware and other data-centery-things, I could see it reaching into the $200M range. IMO $400M might be a bit of a stretch but not too wildly off base.
To reach parity with the rental price, they would have needed to re-train 70 times (i.e. over 12 years). They obviously did not do that, so I agree it's a bit unfair to cost this based on $2M in GPU rentals. Why did they buy instead of rent? Probably because it's not actually that cheap to get 2048 concurrent high-performance connected GPUs for 60 days. Or maybe just because they had cash for capex.
Something like an H100 is definitely a feat of engineering, though.
Nothing prevents Cooler Master from releasing a line of GPUs equally performant and, while at it, even cheaper. But when we measure reality, after the wave function of fentanyl and good intentions collapses ... oh yeah, turns out only nVidia is making those chips, whoops ...
Looking around a bit the price was ~70k$USD _in China_ around the time they where released in 2023, cheaper bulk sells where a thing, too.
Note that this are the China prices with high markup due to export controls etc.
The price of a H800 80GiB in the US is today more like ~32k$USD .
But for using H800 clusters well you also need as fast as possible interconnects, enough motherboards, enough fast storage, cooling, building, interruption free power etc. So the cost of building a "H800" focused Datacenter is much much higher then multiplying GPU cost by number.
You can’t buy the gpus individually, and even if you can on a secondary market, you can’t use them without the baseboard, and you can’t use the baseboard without a compatible chassis, and a compatible chassis is full of CPUs, system memory etc. on top of that, you need a fabric. Even if you cheap out and go RoCE over IB it’s still 400gbs hcas, optics and switches
Yea, a node in a cluster costs as much as an American house. Maybe not on its own, but to make it useful for large scale training, even under the new math of deepseek, it costs as much as a house.
They estimated $200k for a single NVIDIA GPU-based CPU complete with RAM and networking. That's where my number came from. (RAM and especially very-high-speed networking is very expensive at these scales.)
"Add it all up, and the average selling price of an Nvidia GPU accelerated system, no matter where it came from, was just under $180,000, the average server SXM-style, NVLink-capable GPU sold for just over $19,000 (assuming the GPUs represented around 85 percent of the cost of the machine)"
That implies they assumed an 8-GPU system. (8 × $19,000 = $152,000 ≈ 85% × $180,000)
To clarify, a legitimate benchmark for training is to calculate the running cost, not capex cost. Because obviously the latter would drop dramatically with the number of models you train. But to put into context, Meta wants to spend 50B on AI this year alone. And it already has 150x the compute of DS. The very real math going through investors head is - what's stopping Zuck from taking 10B of that and mailing a 100 million signing bonus to every name on the R1 paper?
The $6M that is thrown around is from the DS V3 paper and is for the cost of a single training run for DeepSeek V3 - the base model that R1 is built on.
The number does not include cost for personell, experiments, data preparation, chasing dead ends, and most importantly, it does not include the reinforcement learning step that made R1 good.
Furthermore, it is not factored in that both R3 and V1 are build on top of an enormous amount of synthetic data the was generated by other LLMs.
Comparing cost of buying with cost of running is weird. It's not like they build a new cluster, train just this one model, and then incinerate everything.
They bought between 10k and 50k of them before the US restrictions came into place. Sounds like DeepSeek gets to use them for training, as they were profitable (could still be, not sure).
Electricity in China, even at residential rates, is 1/10th the cost it is in CA.
I think the salient point here is that the "price to train" a model is a flashy number that's difficult to evaluate out of context. American companies list the public cloud price to make it seem expensive; Deepseek has an incentive to make it sound cheap.
The real conclusion is that world-class models can now be trained even if you're banned from buying Nvidia cards (because they've already proliferated), and that open-source has won over the big tech dream of gatekeeping the technology.
Over the last few days people have asked me if they think NVIDIA is fkd.. It still takes two H100s to run inference on the DS v3 671b @ <200 tokens per second.
There are different versions of the model as well as using it with different levels of quantization.
Some variants of DeepSeek-R1 can be run on 2x H100 GPUs, and some people managed to get still quite decent results with a even stronger distilled mode running it on consumer hardware.
For DeepSeek-V3 even with 4bit quantization you need more like 16x H100.
Their paper (arxiv 2412:1947) explains they used 2048 H800s. A computer cluster based on 2048 GPUs would have cost around $400M about two years ago when they built it. (Give or take, feel free to post corrections.)
The point is they got it done cheaper than OpenAI/Google/Meta/... etc.
But not cheaply.
I believe the markets are overreacting. Time to buy (tinfa).