Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

0. There already are similar lawsuits against OpenAI https://www.documentcloud.org/documents/23963237-authors-v-o...

1. Because they literally said they used Books3 in the original Llama paper. The provenance of datasets used by other models is not as well documented. Books3 is known to be pirated.

2. Being free to use doesn't mitigate the authors' complaint in any way. (Compare: "I stole your bike, but then I gave it away.") The authors and artists (in the case of image models) want either a) to not have their work included in training sets or b) to be paid for that use via licensing. In either case they must enjoin the trainer of the model.



Your points make sense. Thanks.

For point 1, there are employees at OpenAI who do know the provenance of the datasets used and I am sure (based on my experiences) that it includes knowingly downloaded and inserted copyrighted works. Is not one single employee of OpenAI willing to blow the whistle?


Honestly, I doubt anybody cares that much. It's pretty much an open secret predicated on the untested idea that training is fair use, and the stakes aren't really that high in the long run even if it's not.

Losses for the tech companies will just mean training data gets more expensive. They're all spending tens of billions on new data centers, so there's not even a question of whether they can afford it.


Yes, that's all true. Plus these days the datasets are sanitized from copyrighted works




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: