> presumably LLM output is going into the training data of later LLMs
The LLM vendors go to great lengths to assure their paying customers that this will not be the case. Yes, LLMs will ingest more LLM-generated slop from the public Internet. But as businesses integrate LLMs, a rising percentage of their outputs will not be included in training sets.
The LLM vendors aren't exactly the most trustworthy on this, but regardless of that, there's still lots of free-tier users who are definitely contributing back into the next generation of models.
Please describe these "great lengths". They allowing customer audits now?
The first law of Silicon Valley is "Fake it till you make it", with the vast majority never making it past the "Fake it" stage. Whatever the truth may be, it's a safe bet that what they've said verbally is a lie that will likely have little consequence even if exposed.
I don't know where they land, but they are definitely telling people they are not using their outputs to train. If they are, it's not clear how big of a scandal would result. I personally think it would be bad, but I clearly overindex on privacy & thought the news of ChatGPT chats being indexed by Google would be a bigger scandal.
ChatGPT training is (advertised as) off by default for their plans above the prosumer level, Team & Enterprise. API results are similarly advertised as not being used for training by default.
Anthropic policies are more restrictive, saying they do not use customer data for training.
The LLM vendors go to great lengths to assure their paying customers that this will not be the case. Yes, LLMs will ingest more LLM-generated slop from the public Internet. But as businesses integrate LLMs, a rising percentage of their outputs will not be included in training sets.