All valid points, but I disagree with the conclusion, for several reasons:
* First of all, the GPT-3 authors successfully trained a model with 175 billion parameters. I mean, 175 billion. The previous largest model in the literature, Google’s T5, had "only" 11 billion. Models with trillions of weights are suddenly looking... achievable. That's a significant experimental accomplishment.
* Second, the model achieves competitive results on many NLP tasks and benchmarks without finetuning, using only a context window of text for instructions and input. There is only unsupervised (i.e., autoregressive) pretraining. AFAIK, this is the first paper that has reported a model doing this. It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.
* Finally, the model’s text generation fools human beings without having to cherry-pick examples. AFAIK, this is the first paper that has reported a model doing this. It's another significant experimental accomplishment.
More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering. I think this tendency is misguided. We must build giant machines and gather experimental evidence from them -- akin to physicists who build giant high-energy particle colliders to gather experimental evidence from them.
I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"
> practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering
This paper implements an architecture that will be out of reach for me for about 5 years. So, I ask myself "why will this paper matter in 5 years?"
There are two reasons I can imagine:
1. It shows that there is no phase change in the size-performance trend already documented over many orders of magnitude.
2. It was used as input data for pruning or distillation algorithms rooted in a better understanding of why language models work.
If the NLP community remains laser focused on hill-climbing compute-agnostic benchmarks, I don't think there will be enough people working on 2.
If 1 is the only reason it matters, I struggle to see how it is worth the cost in high-end research talent. It feels like standard low-risk corporate iteration.
3. First evidence that performance continues to improve at hundreds of billions of weights -- paving the way for trillions of weights, approaching orders of magnitude comparable to that of the human brain connectome.
4. First evidence (AFAIK) that larger NLP models do not need task-specific finetuning -- paving the way for general-purpose models that work well on any NLP task without additional training.
5. First evidence (AFAIK) that larger NLP models fool human beings without cherry-picking -- paving the way for models that can pass ever more challenging Turing tests.
I think the larger models get, the more incentive there is for researchers to look into pruning/distilling them for practical use.
GPT-1,2,3 et al. have all shown that larger is better. While in the short term this means people will simply throw larger and larger clusters at the problem, in the longer term there needs to be inovation in making it more efficient on the clusters we have (as even the cloud has limits).
I think sheer parameter count is an important part of the equation in general intelligence, so it's important that there are labs that work on scaling up promising leads to trillions of parameters on top of labs thinking of new promising directions.
> More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering.
It's just that this kind of work is more interesting as a general member of the public than as an AI researcher.
As a human being I find it really interesting to see where this kind of models can take us. I was amazed playing with GPT-2 online demos and seeing to what extent it could generate text that looked like what a human could produce. With its quirks and problems, but still impressive. And I can't wait to put my fingers on a GPT-3 online demo.
But as an NLP academic researcher (and this is not hypothetical, I'm actually one), what do I learn from this paper? What importance does it have to my research? Actually very little. You need more than 350 GB to fit the 175B parameters in memory, currently the largest GPU I can access has 24 GB (and I can access only one of those, which I use to -barely- run BERT-large). The cost of training the model in the cloud is estimated to be $12 million (https://twitter.com/eturner303/status/1266264358771757057). This is a single training run, not including any neural architecture search, bug fixing, etc. So even though for an academic researcher my funding situation is not bad at all, I'm like a couple of orders of magnitude away from being able to do anything meaningful with models of this size, and can't expect that to change for at least 8-10 years (by which point, at the pace NLP evolves, this will be ancient history anyway).
On the other hand, of course very often you learn useful ideas from papers that you can apply yourself even if it's not by implementing the same models in the paper, but that's not the case either. Here the lesson learned is "bigger is better" and I cannot train these enormous models, so there is not really much here that I can apply.
So as an academic researcher, really there isn't a lot to do with this apart from shrugging, and basically dismissing it and just keeping trying to do our best with what we have. Which is still useful, at least if we don't want NLP applications to be in the hands of an oligopoly of megacorps and restricted only to the few most economically viable languages.
I disagree. If as a researcher you come up with a new kind of model that (1) performs better than small-scale transformers given the same computational budget, and (2) improves its performance as you increase the computation budget (model size), quite a few people will be curious to see how your model would perform at greater and greater scale.
The 24GB card mentioned is almost certainly a RTX Titan, which are $3000 each. Just the card. Second, training frameworks like Megatron can distribute to multiple GPUs in the same computer as if they were on different machines, but the naive trainer is greatly helped by NVLink in order to actually look the memory and greatly improve accuracy, which means V100s which are $5000 each. (Also, people use Linux for ML)
Most universities have access to supercomputers, including GPU clusters. But that's not the point, not every NLP problem requires experimenting with 175B parameter models.
Academic researchers shouldn't try to compete with Google or OpenAI in scaling up models. They should try to come up with new approaches. Our brains have been evolving under tight constraints (size, energy, noise, etc). Maybe a good academic problem to solve is "how can I do what GPT-3 does if I only have an 8 GPU workstation?" This might lead to all kinds of breakthroughs.
I don't know why this reminds you of Sutton's essay. Sutton claims that algorithmic work will be superseded by brute-force approaches powered by exponential growth in computing power. But this paper is not the result of exponential growth in computing power. It's the result of $1b worth of Azure credits provided for free by Microsoft.
Rich Sutton is a great scientist but he is fooled by randomness. He initiated his research program just as Moore's law was taking off. Thanks to Moore, his approach saw incredible success and brought him deserved acclaim. But just as Moore's law is pulling the rug from under him, he is using his stature to claim that no other approach but his can work.
You have misunderstood Sutton's argument.
Quoting Rich:
One thing that should be learned from the bitter lesson is
the great power of general purpose methods, of methods that
continue to scale with increased computation even as the
available computation becomes very great.
The point isn't that improvements in our algorithms is unnecessary or unhelpful, rather that the algorithms we should focus on will be capable of scaling with arbitrary amounts of compute/data.
Such as, for example, neural networks, where we see an almost constant rate of improvement (for the appropriate architecture) as more resources are added.
I think this argument is neither here nor there. For computer vision problems, we use convnets, which are models inspired by a biological model of vision. By doing that we are embedding our preconceived notions of what vision is into our models instead of throwing compute and data at the problem. Earlier attempts using multi-layer perceptrons have been massive failures. Is this consistent with Sutton's analysis or contrary to it?
Rich used to be very bullish on neural nets, then somewhat dismissive of them (due to the fragility/inadequacy of FCNs), and then increasingly enthusiastic as the renewed interest demonstrated that those problems could be overcome-- e.g., through better initialization, training, and (as you note) different architecture choices.
His main concern was whether a method could keep working as more resources became available, as otherwise you would tautologically end up with something short of true artificial intelligence.
The important thing is that the technique can scale with increasing data or compute without hitting a hard or soft limit.
> More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering. I think this tendency is misguided. We must build giant machines and gather experimental evidence from them -- akin to physicists who build giant high-energy particle colliders to gather experimental evidence from them.
The analogy is a bit off to me. As far as I can tell, there was significant impetus from within particle physics to commit a huge amount of resources and political effort toward verifying theories with experiment. I don't see anything similar in deep learning, because in this case the "theory" is mostly that "bigger is probably better". I think that idea is pretty uncontroversial for stuff like this. And if the work reduces to marshaling enough resources, what exactly is it?
We should give OpenAI some credit for doing the damn thing, but as is the result kind of seems like an answer to a question that people weren't really asking.
I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"
Moore's law is running on fumes at this point. The complexity of further scaling has reached geopolitical proportions. We need to get back to looking at more creative models in both the software and hardware domains.
the model achieves competitive results on many NLP tasks and benchmarks without finetuning
The article dismisses this result with the following analogy:
“No, my 10-year-old math prodigy hasn’t proven any new theorems, but she can get a perfect score on the math SAT in under 10 minutes. Isn’t that groundbreaking?”
And I tend to agree. We've had a game of benchmarking brinkmanship for a while now. At what point are we going to see some groundbreaking applications?
> We need to get back to looking at more creative models in both the software and hardware domains.
That would be pretty foolish, given the fact that every hand crafted model eventually gets surpassed with brute force. A better use of time would be tackling whatever you mean by "complexity of further scaling has reached geopolitical proportions". I'm not a fan of it, as it is terribly inelegant, but denying the years of consistent brute force wins would just be silly.
The best strategy for any nation with an interest in AI (be it economic or something much more skynety) would be securing two things very quickly: fabrication capacity and nuclear power, because this stuff is going to be measure megawatts - not ANN layers. Improving the efficiency of that conversion would certainly be helpful, but history has shown that to be a lower priority, just take a look at how ridiculously deep software stacks are compared to 20 years ago. I really wish the linguists had been proven right in the 1970s...
That would be pretty foolish, given the fact that every hand crafted model
Who said anything about hand crafted AI models? I’m talking about revisiting our models of computation. Moore’s law has long made it impossible to challenge the dominance of Von Neumann. Perhaps what we need to make further progress is some sort of decentralized, busless computer? Who knows?
If that was the first time you'd seen mention of that essay, then you'd be forgiven for not knowing that always precedes discussion of hand crafting vs brute force. I've never seen it be answered with a call for exotic computation, so my mistake. I've seen plenty of energy requirement estimates related to defeating various cryptographic algorithms, using spherical cows, etc. You'll need a lot more than an architectural change to make a noticeable dent, you'll need the discovery and industrialization of new physics - not the sort of thing you want to hang your hopes on.
I never said otherwise. Like physicists, we need both new theory (for new insights, ideas, models, etc.) and new experiments (for replication, performance, scalability, etc.)
Also, Moore's Law is running on fumes, yes, but there's quite a bit of R&D focused on coming up with hardware that massively scales up (e.g., by more efficiently parallelizing) the dense and sparse multiply-sum operations common to so many AI models. I think Sutton's point about models that leverage computation is spot-on.
> It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.
This premise is still purely science fiction. This model does not touch on either novel tasks nor being free from pretraining (unless I misunderstand). But overall, I think you’re right: it’s significant for a number of reasons.
GPT-3 was pretrained on five datasets (Common Crawl, WebText2, Books1, Books2, and Wikipedia; see table 2.2), and then used on previously unseen tasks (Q&A, translation, cloze, etc.) without finetuning, i.e., weights were not updated after the original (autoregressive) pretraining. This promises a possible future in which general-purpose models are pretrained once, and deployed to production for multiple tasks.
This is pretty mind-boggling, to the point of suspecting an error in the methodology. If it is just a language model, then it has no baked in notion of test time tasks. How on earth does a language model know what is required of it at test-time without fine tuning? How does it know that the test time prompt are examples of the task, and not some story prompt it's supposed to riff off in random ways?
For each task, the authors feed a context window of text with either zero to a few sample queries and responses, followed by a query without the response. The model generates a response for the last query. BTW, this approach is analogous to what you would do with a human being: you would provide zero to a few sample questions and answers, and then ask a question.
Think about the waves of fake news to come. Soon every corporation (and political group) in American will have a data center dedicated to overloading social media with messaging that benefits them. At least right now it’s not so hard to spot the bots.
I am having some trouble seeing the benefits in lieu of the mess that is coming.
I'm an engineer and this is a boring paper. They didn't beat the conclusively beat SOTA the way BERT/T5 beat basically everything. Sure, this is "unsupervised", but from an engineering perspective that's entirely uninteresting, so we don't even get actual scaling numbers. This paper was written to pretend it is more than it is, which is basically the hallmark of everything OpenAI does.
> Transformers are extremely interesting. And this is about the least interesting transformer paper one can imagine in 2020.
Because it's not a transformer paper.
This paper goal was to see how far can an increase in compute continue to deliver an increase in model performance.
There is no better way to study this than to take a very well known architecture and keep it the same as possible, otherwise it becomes very hard to know what is due to the increase size of the model and what is due to the tweaks you make.
So yes, it's a disappointing paper if you expect it to be on a different topic than what it is.
> “GPT-3″ is just a bigger GPT-2. In other words, it’s a straightforward generalization of the “just make the transformers bigger” approach
Yes it’s true. But there is a difference between what’s interesting and what works. deep learning (RNNs, transformers, etc.) is usually old ideas applied at large scale with slight modifications. Proving a model works well at large scale (175B parameters) is a great contribution and measures our progress towards AI.
Article> One of their experiments, “Learning and Using Novel Words,“ strikes me as more remarkable than most of the others and the paper’s lack of focus on it confuses me.
This sort of "learning" is not necessarily real learning and it's not new for GPT-3. Even reduced GPT-2 willingly used made-up terms from the prompt in its results:
Search the article for 'Now I will feed it the same thing, but with a bunch of made-up terms.' It has some examples of how that stuff worked.
I've already posted this in the original discussion of GPT-3 paper and I will post it again: statements about whether some system "learns new words" or "does math" require hypothesis formulation and testing. It astounds me that many people in ML community not only don't do these sort of things, but even actively oppose to the very idea of them being necessary.
Recently there was a great live-stream from DarkHorse talking about this problem in science in general:
My biggest problem with GPT3 is that it's not going to be accessible (practically speaking) to the general public. There's been a recent push to democratize this type of work with libraries like Huggingface transformers, but models this large will force the benefits of this work back into the ivory tower.
Meanwhile, a revolutionary paper that brought for the first successful time a new paradigm to NLP (latent variational autoencoders) and that destroy GPT 3 on text perplexity on the Pen treebank
(4.6 vs 20) and with order of magnitudes less parameters
is talked about nowhere on the web...
The bound is completely invalid, as are the NLL/PPL numbers they report with the MELBO. Look at the equation. If they optimized it directly, it would be trivially driven to 0 by the identity function if we used a latent space equivalent to the input space. The MELBO just adds noiseless autoencoder reconstruction error to a constant offset equal to log of the test set size. This can be driven to zero by evaluating an average bound over test sets of size 1.
The mathematical/conceptual error is that they are assuming each test point is added to the "post-hoc aggregated" prior when they evaluate the bound. This is analogous to including a test point in the training set. Another version of this error would be adding a kernel centered on each test point to a kernel density estimator prior to evaluating test set NLL. In this case, obviously the best kernel has variance 0 and assigns arbitrarily high likelihood to the test data.
It’s only been a few months. If it really does robustly get 4.6 perplexity on PTB (less than I ever thought was possible) then it will receive its due recognition, at the very least via the Hutter prize.
I think using AE for text generation is a good idea, and is pretty old one (I tried it myself back in 2015 without particularly good results), but I wouldn't call it a breakthrough. To me a breakthrough/new paradigm would be something like this: https://arxiv.org/abs/1906.05317
The breakthrough is not that they are the first to use AE for text. It is that they are the first to eliminate the bottleneck that prevented AEs to beat transformers.
You linked paper seems interesting on paper but does it bring any new SOTA?
why you feel this paper so important? the best summary are the benchmarks:
first place at question answering (on yahoo task)
First place on language modeling for pen treebank (by FAR)
So it is a totally new model that will probably keep evolving and being applied to more and more kind of NLP tasks. And it seems that it can have the first place on most NLP tasks, its empirically the breakthrough of the year.
It achieve this while having an extremely small number of parameters which shows:
The model is smarter
The model has room for more parameter hence even more accuracy!
Finally, theoretically it is a breakthrough as it is a port of a computer vision technology (variational autoencoders) to the NLP world.
Actually it might be the successor to the Transformer paradigm.
I wonder what pile of incremental improvement will researchers will be able to bring to it like they have on the Transformer paradigm (spanBert, XLnet, etc)
Not all models scale up well with more parameters.
VAEs are not really exclusively a vision thing, they have been used in a variety of settings. Using VAEs for NLP is also nothing new, an early example is Bowman et al, 2015.
What kind of bottleneck are you talking about? VAEs as such are really in a different category from transformers. They are primarily a tool to get to more structured latent spaces, which is not something transformers are good at in the first place.
The perplexity numbers are for different tasks. MIM is encoding a sentence into a latent variable and then reconstructing it, and achieves PTB perplexity 4.6. GPT-2 is generating the sentence from scratch, which will on average have higher perplexity numbers. I agree that PTB perplexity 4.6 on autoregressive language modeling would be a huge result.
What does GPT mean?
I assume it is not about partition tables (GUID Parition Table), it has something to do with NLP, but besides that it is hard to find what does this acronym mean.
It would be cool if there was a platform to crowd source compute resources to train stuff like this so that regular people (without 7 figure budgets) can have access to these models which are becoming increasingly out of reach to the general public.
Here is a recent paper (disclaimer: I am the first author) named "Learning@home" which proposes something along these lines. Basically, we develop a system that allows you to train a network with thousands of "experts" distributed across hundreds or more of consumer-grade PCs. You don't have to fit 700GB of parameters on a single machine and there is significantly less network delay as for synchronous model parallel training. The only thing you sacrifice is the guarantee that all the batches will be processed by all required experts.
Newbie question: If/when models the size of GPT3 are released to the general public, will average people going to be able to run them on their PCs, as they can with GPT2? Or will that basically be impossible now without expensive specialty hardware?
This one uses FP16, so you just need to have a server with >350GB of RAM. 512GB of DDR4 would set you back around two grand. A total cost of a server for this would probably be under $5k. Comparable to a good gaming rig.
It uses FP16 yes, but the question was about average people running them on their PCs. I don't think most PCs have fp16 support, so you'd have to do it in fp32, doubling the size. It's likely not so fast on a CPU either with that size, especially when using FP32.
The ALUs (FPUs) in most CPUs are 64 bit (even more than that internally), but this does not matter, because we don't care how many bits our floats take inside the CPU, we care about how much space they take in our server's RAM. From our point of view, we supply weights and inputs to the CPU (both in FP16), CPU multiplies them (using 64 bit multipliers), and then spits out the result, which is cast to FP16, and that's what gets stored in memory.
GPT-2 takes 500ms per word on our benchmarks on a Xeon 4114, compared to 15ms on a Titan RTX. So the answer is technically yes, practically no, but why would you?
> it would represent a kind of non-linguistic general intelligence ability which would be remarkable to find in a language model
As a relative outsider to this field, I don’t really see the stark line between natural language and general intelligence implied by this statement. Language is just abstractions encoded in symbols, and general intelligence is just the ability to construct and manipulate abstractions. Seems reasonable to think that these are two sides of the same coin.
Put another way, natural language is the product of general intelligence.
I think the line you'd see is that there exists some task where the language-based model suddenly lacks the ability to perform the task despite the fact that it "should."
I'd conjecture that this might include something like describing where places are in relation to each other, and asking it to describe a route. (Not an NLP expert, but work with AI folks; this task chosen as an example because it seems like something you'd want a planner for rather than anything MLful.)
I think the main disappointment is that we humans aren't that special when a brute-forced scalable transformer is getting into our ballpark. We have also recently seen how Open AI + MS were able to use a GPT-variation for automated text-description-to-python-code generation, and utilizing something like GPT-3 in that task might render many swengs obsolete fairly soon.
I could not disagree more with this post. To summarize what the author is unhappy with:
1) "It’s another big jump in the number, but the underlying architecture hasn’t changed much... it’s pretty annoying and misleading to call it “GPT-3.” GPT-2 was (arguably) a fundamental advance, because it demonstrated the power of way bigger transformers when people didn’t know about that power. Now everyone knows, so it’s the furthest thing from a fundamental advance."
2) "The “zero-shot” learning they demonstrated in the paper – stuff like “adding tl;dr after a text and treating GPT-2′s continuation thereafter as a ‘summary’” – were weird and goofy and not the way anyone would want to do these things in practice... They do better with one task example than zero (the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently there is not too much progressive “learning as you go” here."
3) "Coercing it to do well on standard benchmarks was valuable (to me) only as a flamboyant, semi-comedic way of pointing this out, kind of like showing off one’s artistic talent by painting (but not painting especially well) with just one’s non-dominant hand."
4) "On Abstract reasoning..So, if we’re mostly seeing #1 here, this is not a good demo of few-shot learning the way the authors think it is."
---------
My response:
1) The fact that we can get so much improvement out of something so "mundane" should be cause for celebration, rather than disappointment. It means that we have found general methods that scale well and a straightforward recipe for brute-forcing our way through solutions we haven't solved before.
At this point it becomes not a question of possibility, but of engineering investment. Isn't that the dream of an AI researcher? To find something that works so well you can stop ``innovating'' on the math stuff?
2) Are we reading the same plot? I see an improvement after >16 shot.
I believe the point of that setup is to illustrate the fact that any model trained to make sequential decisions can be regarded as "learning to learn", because the arbitrary computation in between sequential decisions can incorporate "adaptive feedback". It blurs the semantics between "task learning" and "instance learning"
3) This is a fair point actually, and perhaps now that models are doing better (no thanks to people who spurn big compute), we should propose better metrics to capture general language understanding.
4) It's certainly possible, but you come off as pretty confident for someone who hasn't tried running the model and trying to test its abilities.
Who is the author, anyway? Are they capable of building systems like GPT-3?
I think we should not compare if anybody is capable or not. Here the most of the work is done by azure engineers and none of them are mention in the paper. So no even open ai can’t do it without azure infrastructure.
* First of all, the GPT-3 authors successfully trained a model with 175 billion parameters. I mean, 175 billion. The previous largest model in the literature, Google’s T5, had "only" 11 billion. Models with trillions of weights are suddenly looking... achievable. That's a significant experimental accomplishment.
* Second, the model achieves competitive results on many NLP tasks and benchmarks without finetuning, using only a context window of text for instructions and input. There is only unsupervised (i.e., autoregressive) pretraining. AFAIK, this is the first paper that has reported a model doing this. It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.
* Finally, the model’s text generation fools human beings without having to cherry-pick examples. AFAIK, this is the first paper that has reported a model doing this. It's another significant experimental accomplishment.
More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering. I think this tendency is misguided. We must build giant machines and gather experimental evidence from them -- akin to physicists who build giant high-energy particle colliders to gather experimental evidence from them.
I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"
http://www.incompleteideas.net/IncIdeas/BitterLesson.html