It means you can send data from the memory of 1 GPU to another GPU without going...

ot1138 · on April 12, 2024

Is this really efficient or practical? My understanding is that the latency required to copy memory from CPU or RAM to GPU negates any performance benefits (much less running over a network!)

llm_trw · on April 12, 2024

Yes, the point here is that you do a direct write from one cards memory to the other using PCIe.

In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

Until this post it seemed like they had ripped all such functionality of their consumer cards, but it looks like you can still get it working at lower speeds using the PCIe bus.

acka · on April 13, 2024

> In older NVidia cards this could be done through a faster link called NVLink but the hardware for that was ripped out of consumer grade cards and is only in data center grade cards now.

NVLink is still very much available in both RTX 3090 and A6000, both of which are still on the market. It was indeed removed from the RTX 40 series{0].

[0]: https://www.pugetsystems.com/labs/articles/nvidia-nvlink-202...

spxneo · on April 12, 2024

so whats stopping from somebody buying a ton of GPUs that are cheap and wiring it up via P2P like we saw with crypto mining

genewitch · on April 12, 2024

crypto mining only needs 1 PCIe lane per GPU, so you can fit 24+ GPUs on a standard consumer CPU motherboard (24-32 lanes depending on the CPU). Apparently ML workloads require more interconnect bandwidth when doing parallel compute, so each card in this demo system uses 16 lanes, and therefore requires 1.) full size slots, and 2.) epyc[0] or xeon based systems with 128 lanes (or at least greater than 32 lanes).

per 1 above crypto "boards" have lots of x1 (or x4) slots, the really short PCIe slots. You then use a riser that uses USB3 cables to go to a full size slot on a small board, with power connectors on it. If your board only has x8 or x16 slots (the full size slot) you can buy a breakout PCIe board that splits that into four slots, using 4 USB-3 cables, again, to boards with full size slots and power connectors. These are different than the PCIe riser boards you can buy for use with cases that allow the GPUs to be placed vertically rather than horizontally, as those have full x16 "fabric" that interconnect between the riser and the board with the x16 slot on them.

[0] i didn't read the article because i'm not planning on buying a threadripper (48-64+ lanes) or an epyc (96-128 lanes?) just to run AI workloads when i could just rent them for the kind of usage i do.

myself248 · on April 12, 2024

Oooo, got a link to one of these fabric boards? I've been playing with stupid PCIe tricks but that's a new one on me.

genewitch · on April 12, 2024

https://www.amazon.com/gp/product/B07DMNJ6QM/

i used to use this one when i had all (three of my) nvme -> 4x sata boardlets and therefore could not fit a GPU in a PCIe slot due to the cabling mess.

myself248 · on April 12, 2024

Oh, um, just a flexible riser.

I thought we were using "fabric" to mean "switching matrix".

wmf · on April 12, 2024

That's what this thread is about. Geohot is doing that.

wtallis · on April 12, 2024

Crypto mining could make use of lots of GPUs in a single cheap system precisely because it did not need any significant PCIe bandwidth, and would not have benefited at all from p2p DMA. Anything that does benefit from using p2p DMA is unsuitable for running with just one PCIe lane per GPU.

numpad0 · on April 12, 2024

PCIe P2P still has to go up to a central hub thing and back because PCIe is not a bus. That central hub thing is made by very few players(most famously PLX Technologies) and it costs a lot.

wtallis · on April 12, 2024

PCIe p2p transactions that end up routed through the CPU's PCIe root complex still have performance advantages over split transactions using the CPU's DRAM as an intermediate buffer. Separate PCIe switches are not necessary except when the CPU doesn't support routing p2p transactions, which IIRC was not a problem on anything more mainstream than IBM POWER.

numpad0 · on April 13, 2024

Maybe not strictly necessary, but a separate PCIe backplane just for P2P bandwidth bypasses topology and bottleneck mess[1][2] of PC platform altogether and might be useful. I suspect this was the original premise for NVLink too.

1: https://assets.hardwarezone.com/img/2023/09/pre-meteror-lake...

2: https://www.gigabyte.com/FileUpload/Global/MicroSite/579/inn...

sparky_ · on April 12, 2024

I take it this is mostly useful for compute workloads, neural networks, LLM and the like -- not for actual graphics rendering?

CYR1X · on April 12, 2024

jmalicki · on April 12, 2024

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.

zamadatix · on April 12, 2024

Peer to peer as in one pcie slot directly to another without going through the CPU/RAM, not peer to peer as in one PC to another over the network port.

brrrrrm · on April 12, 2024

Yea. It’s one less hop through slow memory

publicmail · on April 13, 2024

PCIe busses are like a tree with “hubs” (really switches).

Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.

If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:

- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it

- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all

In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.

There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.

whereismyacc · on April 12, 2024

this would be directly over the memory bus right? I think it's just always going to be faster like this if you can do it?

toast0 · on April 12, 2024

There's not really any busses in modern computers. It's all point to point messaging. You can think of a computer as a distributed system in a way.

PCI has a shared address space which usually includes system memory (memory mapped i/o). There's a second, smaller shared address space dedicated to i/o, mostly used to retain compatability with PC standards developed by the ancients.

But yeah, I'd expect to typically have better throughput and latency with peer to peer communication than peer to system ram to peer. Depending on details, it might not always be better though, distributed systems are complex, and sometimes adding a seperate buffer between peers can help things greatly.

fulafel · on April 14, 2024

Yes, networking is similarly pointless.