OpenAI's really interesting approach to GPT was to scale the size of the underlying neural network. They noticed that the performance of an LLM kept improving as the size of the network grew so they said, "Screw it, how about if we make it have 100+ billion parameters?"
Turns out they were right.
From a research perspective, I'd say this was a big risk and it turned out they were right -- bigger network = better performance.
Sure, it's not as beautiful as inventing a fundamentally new algorithm or approach to deep learning but it worked. Credit where it's due -- scaling training infrastructure + building a model that big was hard...
It's like saying SLS or Falcon Heavy are "just" bigger rockets. Sure, but that's still super hard, risky, and fundamentally new.
That's the issue though, Yann LeCun is specifically referring to ChatGPT as the standalone model, not the GPT family since a lot of models at Meta, Google, DeepMind are based on a similar approach. His point is that ChatGPT is a cosmetic additional training with prompt with a nice interface, but not a fundamentally different model than stuff we've have had for +2-3 years at this point.
ChatGPT is build on GPT-3. GPT-3 was a big NLP development. The paper has 7000+ citations: https://arxiv.org/abs/2005.14165 It was a big deal in the NLP space.
It wasn't a 'cosmetic' improvement over existing NLP approaches.
Respectfully I don't think you read my comment. GPT3 != ChatGPT. ChatGPT is built on GPT-3 and is not breaking new ground. GPT3 is 3 years old and was breaking new ground in 2020 but Meta/Google/DeepMind all have LLM of their own which could be turned into a Chat-Something.
That's the point LeCunn is making. He's not out there negating that the paper you linked was ground-breaking, he's saying that converting that model into ChatGPT was not ground-breaking from an academic standpoint.
OpenAI's really interesting approach to GPT was to scale the size of the underlying neural network. They noticed that the performance of an LLM kept improving as the size of the network grew so they said, "Screw it, how about if we make it have 100+ billion parameters?"
Turns out they were right.
From a research perspective, I'd say this was a big risk and it turned out they were right -- bigger network = better performance.
Sure, it's not as beautiful as inventing a fundamentally new algorithm or approach to deep learning but it worked. Credit where it's due -- scaling training infrastructure + building a model that big was hard...
It's like saying SLS or Falcon Heavy are "just" bigger rockets. Sure, but that's still super hard, risky, and fundamentally new.