.. Nvidia GH200 Grace Hopper Superchip-powered supercomputer ...
"What is the difference to alternative systems with the same amount of memory?
- Compared to 8x Nvidia H100, GH200 costs 5x less, consumes 10x less energy and has roughly the same performance.
- Compared to 8x Nvidia A100, GH200 costs 3x less, consumes 5x less energy and has a higher performance.
- Compared to 4x AMD Mi300X, GH200 costs 2x less, consumes 4x less energy and has probably roughly the same performance.
- Compared to 4x AMD Mi300A (which has only 512 GB memory, more is not possible because the maximum number of scale-up infinity links is 4), GH200 costs significantly less, consumes 3x less energy and has probably a higher performance.
- Compared to 8x Nvidia RTX A6000 Ada which has significantly less memory (only 384GB), GH200 costs significantly less, consumes 3x less energy and has a higher performance.
- Compared to 8x AMD Radeon PRO W7900 which has significantly less memory (only 384GB), GH200 costs the same, consumes 3x less energy and has a higher performance."
This is a weird comparison written to make things appear only good for GH200.
There are a bunch of tradeoffs that aren't considered and some of the comparisons don't make sense since GH200 is a cpu+gpu, so comparing against a gpu only, is weird.
There is no such thing as a 4x MI300 chassis, they are all 8x.
"I started experimenting with Nvidia's RTX 4090s. I bought a bunch of them and put them into a mining rack and just ran some tests. I quickly figured out that is not the way to go,"
Well, I hope they were at least smart enough to use PCEI4 x16...otherwise that will never, ever work well. 2x 4090 saturate my PCEI4 x16 bus during training workloads. If they were PCEI3 x1 risers... That's 32x lower inter-gpu bandwidth!
Also, for 50K per GPU, and all this hacking, buy an older 8x A100 system with water-cooling or something.
Really depends on the model and the software tricks you're using. With DDP and gradient accumulation, you can reduce the bandwidth bottleneck by quite a bit. We've trained with 4090s running at x4 lanes with very small impact. And running at x4 means you can stuff up to 26-28GPUs on a single cpu node (say epyc) and get PCIe latency and get rid of networking hassle.
Interesting, I would expect the impact to be noticeable at 4x! and yeah it heavily depends on model, sharding method, model vs data parallel. I’m hitting the peak bandwidth due to a very wide, shallow model that is split between each GPU model parallel and with CPU optimizer offload - so worst case scenario there.
But it does kind of validate Nvidia’s choice to remove nvlink. How useful would it really be if x4 PCIE gets reasonably decent perf? Unless your inner dim is massive or something you should be fine.
Never got around to writing some public docs. It's essentially bunch of GPUs on custom aluminum extrusion frames sitting in a server rack, connected to romed8-2t motherboard through pcie splitters.
Power limited to 240w, negligible performance loss while halving energy usage, uses 3 20a circuits.
Performance can range anywhere from 2x4090=1xa100 to 4x4090=1xa100 depending on models, etc.
It's great value for the money, and very easy to resell as well.
I meant each card is limited to 240w, instead of the usual 450w. Also, it's more like 4 circuits after all, because the main cpu/mb/2gpus are on a 15a too.
Ah! Ok, thank you now I get it. That's a very nice rig you have there. So at a guess you didn't care as much about the peak computing capacity as long as whatever you are doing all fits in GPU memory and this is your way of collecting that much memory in a single machine so you still have reasonable interconnect speeds between GPUs?
Yeah, it's really just trying to get as much compute as possible as cheaply as possible interconnected in a reasonably fast way with low latency. Slow networking would be a bottleneck and expensive high end networking would defeat the purpose of staying cheap.
You’d be surprised at how cheap high end networking that outperforms PCIE4 x4 is - 100Gb omni-path nics are running for 20$ on ebay! And those will saturate PCIE3 x16.
Though of course with multiple boards/ram/cpu it gets complicated again.
Note that I don't know those Ebay sellers at all, they're just some of the cheaper results showing up from searching. There seem to be plenty of other results too. :)
A mining rack would mean the metal frame where the 4090's are mounted, not the motherboard. Since they pulled off this build, We will assume they are smart to use a server motherboard with enough PCIe slots and lanes.
Ahhh. I was thinking they might also have been meaning one of the mining specific motherboards, which commonly use a bunch of PCIe x16 physical slots with only PCIe x1 electrical connections.
> ... and due to his preference for keeping work out of the cloud
What a German motivation! I too am all for keeping things under your control and this is certainly a very cool exercise ... but I don't quite see why you can't just ... use the server as a server. Then you can connect to it using any portable device.
Really, what's the point of "workstations" nowadays, at least for this type of application?
Or: I could just put the OEM version into a rack in another room. Why does it need to be on my desk? It’s not like this is some kind of graphics board that needs to sit a short distance from my monitor.
If you were really serious about this, I would think throwing the server into a colocation center would be the way to go. I believe those costs can be as low as a few hundred a month. Security, power, and cooling no longer your problem.
Definitely. The amps are the expensive part, but if you have consistently high load, moving from residential rates might mean you're only paying $100-200 for all those other advantages.
If you're dropping $50k on a workstation you are either a business or in the upper tier of renters/homeowners. Nobody who can afford this is living in a closet out of necessity.