Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges comparison. One uses an 8bit Integer multiply while the other uses a 32b Floating Point multiply which inherently uses at least >4X more energy[1]. If you scale the TPU by 4, it is not an order of magnitude better. The proper comparison should be between the TPU and an equivalent DSP doing 8b computations.
That would show if eliminating the energy consumed due to the Register File accesses is significant.I suspect most of the energy saving comes from having a huge on chip memory.
[1] From slide 21
Function Energy in Pj
8-bit add 0.03
32-bit add 0.1
FP Multiply 16-bit 1.1
FP Multiply32-bit 3.7
Register file *6
L1 cache access 10
L2 cache access 20
L3 cache access 100
Off-chip DRAM access 1,300-2,600