I wonder, is a community trained model feasible? As in, get a few tens of thousands of dev to run a seti@home type app on their GPUs during the night, and at the end you get access to the 175B trained model. If it cost 5m to train, but IIRC that was estimated at cloud gpu prices, if you're using spare capacity you're just paying for electricity.
Fun idea. GPT @ Home :D. Scatter of the inputs would be very cheap as they are tiny LongTensors (sequences of indices), but the Gather of the gradients seems like a bottleneck. These models can be quite large. Maybe each worker only communicates back some sparse or potentially precision-reduced gradients? In the limit, I recall papers that were able to only communicate one bit per dimension. May also be possible to further reduce the number of weights by weight sharing, or e.g. with HyperNetworks.
I long wanted to see a proof-of-work cryptocurrency that does neural network training instead of burning through hashes. Imagine if 0.5% of the planet's energy consumptions (9 GW) was used for training neural networks instead of mining bitcoin! It would also solve the problem of ASICS being 1000x more efficient than GPUs, so everyone can participate. It would incentivise the development of efficient neural network training hardware. Somebody do this already!
It does updates to weights based on 1 bit precision updates each iteration.
It would be fairly trivial to go to less than 1 bit precision too - simply set some threshold (eg 3), and wherever the difference between the weight on the server and the client is greater than 3, transmit a binary "1", else send a binary "0". Then entropy code all the resulting binary.
By adjusting the threshold up and down, you trade off the size of the data to send Vs precision.
Efficiency and resource cost is the big question though. You don't pay for the electricity or part wear that you don't use, and home computers or workstations may not be as efficient at performing a training run vs a task-specific setup. AI@home might end up costing even more, and increase the footprint of the model more, than doing it all together.
Part of the magic really needed is finding simpler ways to achieve the same levels of model robustness.
Sure part wear is relevant but I feel that most parts worldwide gets chucked way before they are worn out. Electric efficiency is probably quite worse though. Although you could possibly find opportunity in maximizing the load in regions that have surplus energy and/or renewable sources.