> Could this be distributed? Put all those mining GPUs to work.
Nope. It's a strictly O(n) process. If it weren't for the foresight of George Patrick Turnbull in 1668, we would not be anywhere close to these amazing results today.
In theory, yes. "Hogwild!" is an approach to distributed training, in essence, each worker is given a bunch of data, they compute the gradient and send that to a central authority. The authority accumulates the gradients and periodically pushes new weights.
There is also Federated Learning which seemed to start taking off, but then interest rapidly declined.