This is super cool, but it's worth pointing that it still relies on supervised l...

modeless · on Nov 23, 2022

Yes. I was disappointed to find that they needed a huge labeled dataset of Diplomacy games to train the language model, and despite that it still generated a lot of nonsense (as usual for language models) that they then had to invent 16 other ad-hoc models to filter out. It's super cool that they got it to work, but it's nothing like a general method for communicating and collaborating with humans on any task.

Hopefully there will be follow-up work to increase the generality by reducing the amount of labeled data and task-specific tweaking required, similar to the progression of AlphaGo->AlphaGo Zero->AlphaZero->MuZero.

sanxiyn · on Nov 23, 2022

Eh, it does learn from self play via RL. One section of the paper is literally titled "Self-play reinforcement learning for improved value estimation". Yes, that's only a small part of the entire system.

andreyk · on Nov 24, 2022

ah, good catch; I was going off of the blog post