It really does. You've got to remember that a good SotA paper takes hundreds of training runs, at least.
I can't go into detail about budgets, but suffice to say if you think $1M is a university compute budget that lets you be a competitive research team on the cutting edge, you are __severely__ underestimating the amount of compute that leading corporate researchers are using. Orders of magnitude off.
On-prem is good for a bit until you're 18 months into your 3 year purchase cycle and you're on K80s while the major research leaders are running V100s and TPUs and you can't even fit the SotA model in your GPUs' memories any more.
Longer to train can mean weeks or even months for one experiment - that iteration speed makes it so hard to stay on the cutting edge.
And this is before considering things like neural architecture search and internet scale image/video/speech datasets where costs skyrocket.
The boundary between corporate research and academia is incredibly porous and a big part of that is the cost of research (compute, but also things like data labelling and staffing ML talent).
Your goalposts moved a few figures. Furthermore, $1 million+ was not a university compute budget - that was money for a single lab on campus (at a general state school nonetheless) on a specific project.
You still have yet to provide any concrete sources to back up your claims. We're talking about contributing to research here. If multi-million dollar training jobs are what it takes to be at the cutting edge you should be able to provide ample sources of that claim.
- "Some of the models are so big that even in MILA we can’t run them because we don’t have the infrastructure for that. Only a few companies can run these very big models they’re talking about" [1]. NOTE: MILA is a very good AI research center and, while I don't know too much about him, that person being quoted has great credentials so I would generally trust them.
- "the current version of OpenAI Five has consumed 800 petaflop/s-days" [2].
- Check out the Green AI paper. They have good number on the amount of compute to train a model and you can translate that into numbers.
I'm not an expert in on-prem ML costs, but I know many of the world's best on-prem ML users use the cloud to handle the variability of their workloads so I don't think on-prem is a magic bullet cost wise.
$1M annually per project (vs per lab) isn't bad at all. It's also way out of whack with what I saw when I was doing AI research in academia, but that was pre deep learning revolution, so what do I know.
Re: the moving goalposts - the distinction is between the cost of a training run and the cost of a paper-worth research result. Due to inherent variability, architecture search, hyperparameter search and possibly data cleaning work, the total cost is a couple orders of magnitude more than the cost of a training run (multiple will vary a lot by project and lab).
I understand why you don't trust what I'm saying. I wish I could give hard numbers, but I'm limited in what I can say publicly so this is the best I can do.
I can't go into detail about budgets, but suffice to say if you think $1M is a university compute budget that lets you be a competitive research team on the cutting edge, you are __severely__ underestimating the amount of compute that leading corporate researchers are using. Orders of magnitude off.
On-prem is good for a bit until you're 18 months into your 3 year purchase cycle and you're on K80s while the major research leaders are running V100s and TPUs and you can't even fit the SotA model in your GPUs' memories any more.
Longer to train can mean weeks or even months for one experiment - that iteration speed makes it so hard to stay on the cutting edge.
And this is before considering things like neural architecture search and internet scale image/video/speech datasets where costs skyrocket.
The boundary between corporate research and academia is incredibly porous and a big part of that is the cost of research (compute, but also things like data labelling and staffing ML talent).