> You could also crowd source this, some distributed application where everyone puts their home machines towards training.
That's how Leela Chess 0 (LC0) replicated the Alpha-zero performance.
In fact, this is actually not that difficult. Assuming you have the means to orchestrate it; all it takes is loading the weights and a batch, computing backprop; and submitting it to the central system which aggregates the gradient updates and then updates the whole network and push new weights (kinda how bitcoin creates a new block).
This is no different to gradient accumulation; just "distributed". In-fact, the system could offload a large number of batches because the returned update is O(1) space to return where n is for batch size; it's just that the O(1) is the size of the network.
That's how Leela Chess 0 (LC0) replicated the Alpha-zero performance. In fact, this is actually not that difficult. Assuming you have the means to orchestrate it; all it takes is loading the weights and a batch, computing backprop; and submitting it to the central system which aggregates the gradient updates and then updates the whole network and push new weights (kinda how bitcoin creates a new block).
This is no different to gradient accumulation; just "distributed". In-fact, the system could offload a large number of batches because the returned update is O(1) space to return where n is for batch size; it's just that the O(1) is the size of the network.