The core problem which would allow laundering (that there isn't a good way to dr...

The core problem which would allow laundering (that there isn't a good way to draw a straight, attributive line between generated code and training examples) to me also presents a potential eventual threat to the viability of co-pilot/codex. It seems like the same thing would prevent it from knowing which published code was written by humans vs which was at least in part an output from the system. Training on an undifferentiated mix of your model's outputs and human-authored code seems like it could eventually lead the model into self-reinforcing over-confidence.

"But snippet proposals call out to GH, so they can know which bits of code they generated!". Sometimes; but after Bob does a co-pilot assisted session, and Alice refactors to change a snippet's location and rename some variables and some other minor changes and then commits, can you still tell if it's 95% codex-generated?