Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’d be kind of surprised if they don’t watermark the content they generate. Just so they don’t train on their own slop.


Maybe some of them already embed some simple, secret marker to identify their own generated content. But people outside the organization wouldn’t know. And this still can’t prevent other companies from training models on synthetic data.

Once synthetic data becomes pervasive, it’s inevitable that some of it will end up in the training process. Then it’ll be interesting to see how the information world evolves: AI-generated content built on synthetic data produced by other AIs. Over time, people may trust AI-generated content less and less.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: