The recent ML-focused chips are really just high efficiency, low precision matri...

The recent ML-focused chips are really just high efficiency, low precision matrix multipliers. It turns out that operation can be made ~10x more efficient than previous processor designs, and is the bottleneck in inference for modern neural network models. Specialized hardware can also accelerate the training somewhat.

So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.

But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.