Neural network learning and inference primarily uses matrix multiplication and a...

Neural network learning and inference primarily uses matrix multiplication and addition, usually at lower bit depths. You can do this on GPUs to great success and massive parallelism, however the GPU is more generalized than this so it takes more silicon, and more power. With a TPU/neural processor you optimize the silicon to a very, very specific problem, generally multiplying large matrixes and then adding to another matrix. On a GPU we decompose this into a large number of scalar calculations and it massively parallelizes and does a good job, but on a TPU we feed it the matrixes and that's all it's made to do, with very large scale matrix operations, with so much silicon dedicated to matrix operations that it often does it in a single cycle.

Another comment mentioned cores and I don't think that's a good way of looking at it, as in most ways a TPU is back to very "few" but hyper-specialized "cores". There is essentially no parallelism in a TPU or neural processor -- you feed it three matrixes and it gives you the result. You move on to the next one.