They spent $500k replicating it. But sure, you can do it too /s

gwern · on Aug 23, 2019

They used research credits, and even that aside, with their code and training tips, you can redo it for $50k on cloud instances or less on dedicated hardware + patience. And look at ImageNet training progress: you can train a near-SOTA ImageNet CNN in like a minute for $20-40 after a lot of optimization work. We've already seen a lot of improvements in LMs over the past 2 years... (For example, the main barrier to training GPT-2 is just the bloody memory use from the Transformers exploding at runtime, which pushes you into high-end hardware like cloud TPUs on GCP. Do Sparse Transformers fix that?)

p1esk · on Aug 23, 2019

Wait, how can I get to near SOTA on Imagenet in a minute (!) for $40?

gwern · on Aug 23, 2019

OK, I exaggerated a little because I was recalling from memory: the old fast.ai approach actually takes <18 minutes (https://www.fast.ai/2018/08/10/fastai-diu-imagenet/). My bad. (I'm sure it's improved since then but I don't know how much.) I was also thinking of https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/ which does CIFAR-10 in 26s but I'm not sure offhand what CIFAR-10's SOTAs look like so not sure how far away that is.

p1esk · on Aug 23, 2019

Actually this is still very good. Thanks for the links. I'll be timing some of these tricks tomorrow for my Imagenet experiments. By the way, I believe this is the current SOTA for Imagenet: https://arxiv.org/abs/1905.11946 (84/97%). CIFAR10 appears to be essentially solved (99%).

ZhuanXia · on Aug 23, 2019

Are you going to update your poetry engine now that we have this?

gwern · on Aug 23, 2019

Hm, maybe. It depends on how easy their training code is to use and how long retraining would take. It presumably will take at least a week because 345M took about a week, but I'm not sure I want to spend the money on a week of a very large cloud instance (which would be what, $300?) for what is probably a substantial but not stunning improvement in generation quality.

I might rather wait for the next leap, from something like a Sparse Transformer approach which can get global coherency by having a lookback over the entire poem or getting a better poetry corpus with delimited poems (rather than entire books).

the8472 · on Aug 23, 2019

You're off by an order of magnitude and omit the caveats to that cost estimation. From the article:

> The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved

p1esk · on Aug 23, 2019

They edited the article after I left a comment there. The original text stated they spent $500k to run all the hyperparameter search experiments to replicate OpenAI results. Only after they did all that work you can run their code for $50k.

kuzehanka · on Aug 23, 2019

Where did you get 500k from? They said 50k. In estimated cloud compute costs.

p1esk · on Aug 23, 2019

They removed this info for some reason. It takes $50k per training run, and they initially said they spent $500k total on experiments. Only after they did all that work you can run their code for $50k.