They used research credits, and even that aside, with their code and training tips, you can redo it for $50k on cloud instances or less on dedicated hardware + patience. And look at ImageNet training progress: you can train a near-SOTA ImageNet CNN in like a minute for $20-40 after a lot of optimization work. We've already seen a lot of improvements in LMs over the past 2 years... (For example, the main barrier to training GPT-2 is just the bloody memory use from the Transformers exploding at runtime, which pushes you into high-end hardware like cloud TPUs on GCP. Do Sparse Transformers fix that?)
OK, I exaggerated a little because I was recalling from memory: the old fast.ai approach actually takes <18 minutes (https://www.fast.ai/2018/08/10/fastai-diu-imagenet/). My bad. (I'm sure it's improved since then but I don't know how much.) I was also thinking of https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/ which does CIFAR-10 in 26s but I'm not sure offhand what CIFAR-10's SOTAs look like so not sure how far away that is.
Actually this is still very good. Thanks for the links. I'll be timing some of these tricks tomorrow for my Imagenet experiments. By the way, I believe this is the current SOTA for Imagenet: https://arxiv.org/abs/1905.11946 (84/97%). CIFAR10 appears to be essentially solved (99%).
Hm, maybe. It depends on how easy their training code is to use and how long retraining would take. It presumably will take at least a week because 345M took about a week, but I'm not sure I want to spend the money on a week of a very large cloud instance (which would be what, $300?) for what is probably a substantial but not stunning improvement in generation quality.
I might rather wait for the next leap, from something like a Sparse Transformer approach which can get global coherency by having a lookback over the entire poem or getting a better poetry corpus with delimited poems (rather than entire books).
You're off by an order of magnitude and omit the caveats to that cost estimation. From the article:
> The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved
They edited the article after I left a comment there. The original text stated they spent $500k to run all the hyperparameter search experiments to replicate OpenAI results. Only after they did all that work you can run their code for $50k.
They removed this info for some reason. It takes $50k per training run, and they initially said they spent $500k total on experiments. Only after they did all that work you can run their code for $50k.