Because they're running out of training data. OpenAI doesn't want to scrape new ...

oblio · on Aug 1, 2024

> Because they're running out of training data.

Now that would be super funny.

Civilization VII tech tree

AI singularity tech

Prerequisites: in order to research this, your world needs to have at least 100 billion college educated inhabitants.

:-)))

XMPPwocky · on Aug 1, 2024

For multimodal input, okay, I can see the argument. But since, as you said, training on generated data can be almost worse than useless, what's the point of generating it?

If I had these concerns as OpenAI, I'd be pushing hard to regulate and restrict generative image/video models, to push the end of the "low background data" era as far into the future as possible. i feel like the last thing I'd be doing is productizing those models myself!

FireBeyond · on Aug 1, 2024

> If I had these concerns as OpenAI, I'd be pushing hard to regulate and restrict generative image/video models

They are! And I'm guessing maybe their perspective is if they can identify their own generative content, they can make a choice to ignore it and not cannibalize.

kmeisthax · on Aug 1, 2024

That's a brand new argument.

I don't actually think it's a bad one, but OpenAI didn't think that far ahead. They are pushing for regulation but that's mainly to screw over competing models, not to give them more data runway. Every capitalist is a temporarily embarrassed feudal aristocrat after all.

Furthermore, even if OpenAI had a perfect AI/human distinguisher oracle and could train solely on human output, that wouldn't get us superhuman reasoning or generalization performance. The training process they use is to have the machine mimic the textual output of humans. How exactly do you get a superhuman AGI[0] without having text generated by a superhuman AGI to train on?

[0] Note: I'm discounting "can write text faster than a human" as AGI here. printf in a tight loop already does that better than GPT-4o.