They used research credits, and even that aside, with their code and training tips, you can redo it for $50k on cloud instances or less on dedicated hardware + patience. And look at ImageNet training progress: you can train a near-SOTA ImageNet CNN in like a minute for $20-40 after a lot of optimization work. We've already seen a lot of improvements in LMs over the past 2 years... (For example, the main barrier to training GPT-2 is just the bloody memory use from the Transformers exploding at runtime, which pushes you into high-end hardware like cloud TPUs on GCP. Do Sparse Transformers fix that?)
OK, I exaggerated a little because I was recalling from memory: the old fast.ai approach actually takes <18 minutes (https://www.fast.ai/2018/08/10/fastai-diu-imagenet/). My bad. (I'm sure it's improved since then but I don't know how much.) I was also thinking of https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/ which does CIFAR-10 in 26s but I'm not sure offhand what CIFAR-10's SOTAs look like so not sure how far away that is.
Actually this is still very good. Thanks for the links. I'll be timing some of these tricks tomorrow for my Imagenet experiments. By the way, I believe this is the current SOTA for Imagenet: https://arxiv.org/abs/1905.11946 (84/97%). CIFAR10 appears to be essentially solved (99%).
Hm, maybe. It depends on how easy their training code is to use and how long retraining would take. It presumably will take at least a week because 345M took about a week, but I'm not sure I want to spend the money on a week of a very large cloud instance (which would be what, $300?) for what is probably a substantial but not stunning improvement in generation quality.
I might rather wait for the next leap, from something like a Sparse Transformer approach which can get global coherency by having a lookback over the entire poem or getting a better poetry corpus with delimited poems (rather than entire books).