Upstart retrofits an Nvidia GH200 server into a workstation

pella · on Feb 16, 2024

  .. Nvidia GH200 Grace Hopper Superchip-powered supercomputer ...

  "What is the difference to alternative systems with the same amount of memory?
  - Compared to 8x Nvidia H100, GH200 costs 5x less, consumes 10x less energy and has roughly the same performance.
  - Compared to 8x Nvidia A100, GH200 costs 3x less, consumes 5x less energy and has a higher performance.
  - Compared to 4x AMD Mi300X, GH200 costs 2x less, consumes 4x less energy and has probably roughly the same performance.
  - Compared to 4x AMD Mi300A (which has only 512 GB memory, more is not possible because the maximum number of scale-up infinity links is 4), GH200 costs significantly less, consumes 3x less energy and has probably a higher performance.
  - Compared to 8x Nvidia RTX A6000 Ada which has significantly less memory (only 384GB), GH200 costs significantly less, consumes 3x less energy and has a higher performance.
  - Compared to 8x AMD Radeon PRO W7900 which has significantly less memory (only 384GB), GH200 costs the same, consumes 3x less energy and has a higher performance."

latchkey · on Feb 16, 2024

This is a weird comparison written to make things appear only good for GH200.

There are a bunch of tradeoffs that aren't considered and some of the comparisons don't make sense since GH200 is a cpu+gpu, so comparing against a gpu only, is weird.

There is no such thing as a 4x MI300 chassis, they are all 8x.

abstractcontrol · on Feb 16, 2024

Considering the system only has a single H100, why would it be that performant?

buildbot · on Feb 16, 2024

Yeah this page is full of straight up lies?

“ Its performance in every regard is almost unreal (up to 284 times faster than x86).”

Like, there are at least 3 things wrong with that statement!

pella · on Feb 16, 2024

benchmark:

"NVIDIA GH200 CPU Performance Benchmarks Against AMD EPYC Zen 4 & Intel Xeon Emerald Rapids"

* https://www.phoronix.com/review/nvidia-gh200-gptshop-benchma...

buildbot · on Feb 16, 2024

"I started experimenting with Nvidia's RTX 4090s. I bought a bunch of them and put them into a mining rack and just ran some tests. I quickly figured out that is not the way to go,"

Well, I hope they were at least smart enough to use PCEI4 x16...otherwise that will never, ever work well. 2x 4090 saturate my PCEI4 x16 bus during training workloads. If they were PCEI3 x1 risers... That's 32x lower inter-gpu bandwidth!

Also, for 50K per GPU, and all this hacking, buy an older 8x A100 system with water-cooling or something.

ftufek · on Feb 16, 2024

Really depends on the model and the software tricks you're using. With DDP and gradient accumulation, you can reduce the bandwidth bottleneck by quite a bit. We've trained with 4090s running at x4 lanes with very small impact. And running at x4 means you can stuff up to 26-28GPUs on a single cpu node (say epyc) and get PCIe latency and get rid of networking hassle.

buildbot · on Feb 16, 2024

Interesting, I would expect the impact to be noticeable at 4x! and yeah it heavily depends on model, sharding method, model vs data parallel. I’m hitting the peak bandwidth due to a very wide, shallow model that is split between each GPU model parallel and with CPU optimizer offload - so worst case scenario there.

But it does kind of validate Nvidia’s choice to remove nvlink. How useful would it really be if x4 PCIE gets reasonably decent perf? Unless your inner dim is massive or something you should be fine.

jacquesm · on Feb 16, 2024

Do you have any pictures and/or documentation of that setup, power draw and performance? It sounds pretty interesting!

ftufek · on Feb 16, 2024

Never got around to writing some public docs. It's essentially bunch of GPUs on custom aluminum extrusion frames sitting in a server rack, connected to romed8-2t motherboard through pcie splitters.

Power limited to 240w, negligible performance loss while halving energy usage, uses 3 20a circuits.

Performance can range anywhere from 2x4090=1xa100 to 4x4090=1xa100 depending on models, etc.

It's great value for the money, and very easy to resell as well.

jacquesm · on Feb 16, 2024

Very nice!

240W?

3 x 20A = 6600W?

ftufek · on Feb 16, 2024

I meant each card is limited to 240w, instead of the usual 450w. Also, it's more like 4 circuits after all, because the main cpu/mb/2gpus are on a 15a too.

jacquesm · on Feb 16, 2024

Ah! Ok, thank you now I get it. That's a very nice rig you have there. So at a guess you didn't care as much about the peak computing capacity as long as whatever you are doing all fits in GPU memory and this is your way of collecting that much memory in a single machine so you still have reasonable interconnect speeds between GPUs?

ftufek · on Feb 16, 2024

Yeah, it's really just trying to get as much compute as possible as cheaply as possible interconnected in a reasonably fast way with low latency. Slow networking would be a bottleneck and expensive high end networking would defeat the purpose of staying cheap.

buildbot · on Feb 16, 2024

You’d be surprised at how cheap high end networking that outperforms PCIE4 x4 is - 100Gb omni-path nics are running for 20$ on ebay! And those will saturate PCIE3 x16.

Though of course with multiple boards/ram/cpu it gets complicated again.

magicalhippo · on Feb 16, 2024

Which cards? Been looking at nics but couldn't find cheap ones past 25-40Gb

justinclift · on Feb 16, 2024

Omni-path is/was the Intel fork of Infiniband, which from rough memory they bought from QLogic some years ago.

* Switch: https://www.ebay.com/itm/273064154224

* Adapters: https://www.ebay.com/itm/296188773061 / https://www.ebay.com/itm/166199451199

* Cables: No idea, sorry. ;)

* Description: https://www.youtube.com/watch?v=dOIXtsjJMYE

Note that I don't know those Ebay sellers at all, they're just some of the cheaper results showing up from searching. There seem to be plenty of other results too. :)

---

Possibly useful, though it's Proxmox focused:

https://forum.level1techs.com/t/proxmox-with-intel-omni-path...

jacquesm · on Feb 16, 2024

Very smart approach. I may copy your setup for some project that I've been working on for years but that stalled waiting for more memory in GPUs.

hugryhoop · on Feb 16, 2024

240W per card probably

jacquesm · on Feb 16, 2024

Indeed.

justinclift · on Feb 16, 2024

Wouldn't an EPYC based motherboard (lots of PCIe x16 slots) generally be the right choice for testing multiple 4090's, rather than a mining rack?

segmondy · on Feb 16, 2024

A mining rack would mean the metal frame where the 4090's are mounted, not the motherboard. Since they pulled off this build, We will assume they are smart to use a server motherboard with enough PCIe slots and lanes.

justinclift · on Feb 17, 2024

Ahhh. I was thinking they might also have been meaning one of the mining specific motherboards, which commonly use a bunch of PCIe x16 physical slots with only PCIe x1 electrical connections.

This kind of thing:

https://www.asrock.com/mb/amd/x370%20pro%20btc+

jrflowers · on Feb 16, 2024

$50k is a steal when you factor in the number of compassionately-worded severance letters you can generate without having to use the Cloud

solarkraft · on Feb 16, 2024

> ... and due to his preference for keeping work out of the cloud

What a German motivation! I too am all for keeping things under your control and this is certainly a very cool exercise ... but I don't quite see why you can't just ... use the server as a server. Then you can connect to it using any portable device.

Really, what's the point of "workstations" nowadays, at least for this type of application?

NovemberWhiskey · on Feb 16, 2024

Or: I could just put the OEM version into a rack in another room. Why does it need to be on my desk? It’s not like this is some kind of graphics board that needs to sit a short distance from my monitor.

gessha · on Feb 16, 2024

A lot of people in Europe don’t have an extra room to put it in. Me neither, signed US metro area dweller.

fbdab103 · on Feb 16, 2024

If you were really serious about this, I would think throwing the server into a colocation center would be the way to go. I believe those costs can be as low as a few hundred a month. Security, power, and cooling no longer your problem.

tyrfing · on Feb 16, 2024

Definitely. The amps are the expensive part, but if you have consistently high load, moving from residential rates might mean you're only paying $100-200 for all those other advantages.

anigbrowl · on Feb 16, 2024

If you're dropping $50k on a workstation you are either a business or in the upper tier of renters/homeowners. Nobody who can afford this is living in a closet out of necessity.

quickthrower2 · on Feb 16, 2024

50k to drop on a server? You can afford the space or colo.

almostnormal · on Feb 16, 2024

> It’s not like this is some kind of graphics board that needs to sit a short distance from my monitor.

I wouldn't be surprised if some of the buyers would consider it as gaming rig with maximum bragging rights.

miahi · on Feb 16, 2024

It's not a video card, it does not have video outputs.

segmondy · on Feb 16, 2024

Yeah for $51,000. I will happy run a rack of 4090s.

latchkey · on Feb 16, 2024

Maybe ok in your house, but if you want to scale out...

https://www.nvidia.com/en-us/drivers/geforce-license/

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted.

DeathArrow · on Feb 16, 2024

€50 000? In 10 years I will buy it for €5000.

Unfrozen0688 · on Feb 16, 2024

Someone call Linus Tech Tips right now!!!

moondev · on Feb 16, 2024

> Of course it's bristling with Noctuas – how else do you cool a 1kW desktop?

Is this implying that noctuas are good at cooling?

wmf · on Feb 16, 2024