Because they're running out of training data. OpenAI doesn't want to scrape new text off the Internet because they don't know what is and isn't AI-generated. Training off AI data tends to homogenize the resulting output, as certain patterns overexpressed by one AI get overexpressed by others.
If they start scraping, training, and generating images and video, then they have lots more data to work with.
For multimodal input, okay, I can see the argument. But since, as you said, training on generated data can be almost worse than useless, what's the point of generating it?
If I had these concerns as OpenAI, I'd be pushing hard to regulate and restrict generative image/video models, to push the end of the "low background data" era as far into the future as possible. i feel like the last thing I'd be doing is productizing those models myself!
> If I had these concerns as OpenAI, I'd be pushing hard to regulate and restrict generative image/video models
They are! And I'm guessing maybe their perspective is if they can identify their own generative content, they can make a choice to ignore it and not cannibalize.
I don't actually think it's a bad one, but OpenAI didn't think that far ahead. They are pushing for regulation but that's mainly to screw over competing models, not to give them more data runway. Every capitalist is a temporarily embarrassed feudal aristocrat after all.
Furthermore, even if OpenAI had a perfect AI/human distinguisher oracle and could train solely on human output, that wouldn't get us superhuman reasoning or generalization performance. The training process they use is to have the machine mimic the textual output of humans. How exactly do you get a superhuman AGI[0] without having text generated by a superhuman AGI to train on?
[0] Note: I'm discounting "can write text faster than a human" as AGI here. printf in a tight loop already does that better than GPT-4o.
If they start scraping, training, and generating images and video, then they have lots more data to work with.