This is super cool, but it's worth pointing that it still relies on supervised learning from a large dataset of human players, and that it's not a general learning algorithm (there is a whole bunch of problem-specific aspects to the model). It not having learning via RL (trial and error, as humans do) or self-play is kind of disappointing.
Don't get me wrong, it's an impressive advance, but just as with AlphaGo it's important not to overgeneralize what this means. I would not be surprised if a lot of people jump to talking about what this means for AGI, but with this learning paradigm it's still pretty limited in applicability.
Yes. I was disappointed to find that they needed a huge labeled dataset of Diplomacy games to train the language model, and despite that it still generated a lot of nonsense (as usual for language models) that they then had to invent 16 other ad-hoc models to filter out. It's super cool that they got it to work, but it's nothing like a general method for communicating and collaborating with humans on any task.
Hopefully there will be follow-up work to increase the generality by reducing the amount of labeled data and task-specific tweaking required, similar to the progression of AlphaGo->AlphaGo Zero->AlphaZero->MuZero.
Eh, it does learn from self play via RL. One section of the paper is literally titled "Self-play reinforcement learning for improved value estimation". Yes, that's only a small part of the entire system.
Don't get me wrong, it's an impressive advance, but just as with AlphaGo it's important not to overgeneralize what this means. I would not be surprised if a lot of people jump to talking about what this means for AGI, but with this learning paradigm it's still pretty limited in applicability.