They just throw away so much of cruft from the die like PCIE PHYs, and x86 legacy I/O with large area analog circuitry.
Redundant complex DMA, and memory controller IPs are also thrown away.
Clock, and power rails on the SoC are also probably taking less space because of more shared circuitry.
Same with self-test, debug, fusing blocks, and other small tidbits.
The seemed power efficiency when PCIE was going 1.0 2.0 3.0 ... was due to dynamic power control, and link sleep.
On top of it, they simply don't haul memory nonstop over PCIE anymore, since data going to/from GPU is simply not moving anywhere.
They just throw away so much of cruft from the die like PCIE PHYs, and x86 legacy I/O with large area analog circuitry.
Redundant complex DMA, and memory controller IPs are also thrown away.
Clock, and power rails on the SoC are also probably taking less space because of more shared circuitry.
Same with self-test, debug, fusing blocks, and other small tidbits.