The throughput of the PCIe link between the CPU and GPU, is far less than the *a...

The throughput of the PCIe link between the CPU and GPU, is far less than the aggregate throughput of the internal interconnects between neighbouring tensor cores.

Matrix operations might flow a lot of data around — but that data flow is akin to a bunch of individual people travelling along the individual residential streets they live on. There's a lot of movement there, but also a lot of capacity for movement, because there's no bottleneck of everyone needing to go to the same place or come from the same place.

Persisting the data out of the GPU and then loading it back in, is more like all those people commuting to work and then going back home. Big fan-in onto the PCIe "highway" over to the CPU and into RAM; then big fan-out back. Traffic jams for miles.

In the time it takes to restore a 1GB state snapshot from RAM into VRAM, you can probably chew through the equivalent of 1TB or more of intermediate matrix states.