Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges comparison. ...

Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges comparison. One uses an 8bit Integer multiply while the other uses a 32b Floating Point multiply which inherently uses at least >4X more energy[1]. If you scale the TPU by 4, it is not an order of magnitude better. The proper comparison should be between the TPU and an equivalent DSP doing 8b computations. That would show if eliminating the energy consumed due to the Register File accesses is significant.I suspect most of the energy saving comes from having a huge on chip memory.

[1] From slide 21

Function Energy in Pj

8-bit add 0.03

32-bit add 0.1

FP Multiply 16-bit 1.1

FP Multiply32-bit 3.7

L1 cache access 10

L2 cache access 20

L3 cache access 100

Off-chip DRAM access 1,300-2,600