> Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned.
Are you objecting to the use of data for training, or to the compilation of training datasets?
> Are you objecting to the use of data for training, or to the compilation of training datasets?
The GP is objecting to the casualness with which privacy rights and intellectual property rights are ignored by so many in the AI community, not to a choice of whether they object to one or another specific manifestation of how the community is doing so.
[edit added]: Those who would argue "no, there is no casual ignoring of rights", consider for example the official DMCA takedown instructions for the collection [0].
IMHO the research community in general is definitely not ignoring intellectual property rights - for natural language processing, there are papers and publication tracks in major conferences specifically about the legal issues of corpora; many researchers are quite painfully aware of all the various legal restrictions that exist.
However, the legal environment is very different than in the commercial world or consumer piracy, as it generally involves various legal exemptions (differing between locales) that do allow such usage. For example, I work in NLP research on aspects that involve handling large corpora of copyrighted text. It's easier to do it with cooperation of the publishers for various practical reasons, however, we still can and do use also the works of the publishers who would refuse to grant any permission, because local copyright law has specific exceptions that allows the usage these works for noncommercial research purposes. Doing so is not ignoring their rights, their rights are not violated but rather they are limited; their exclusivity right (privilege would be a more appropriate word) to make copies is not absolute. There are even some countries with explicit legal duty for the publishers to provide digital versions of their works to national corpora where they will be used for (among other things) machine learning models.
The specific consequence, however, is that we can't legally share the full datasets which we are using with the public, like it was done in the original post with this particular dataset, as that would be a violation of the publishers' commercial rights; we can provide them to specific researchers for limited noncommercial purposes only. But I can download this dataset or one like this and use it my research legally; just as I can rip up a physical book, scan it, make a digital copy and OCR it, and use it in a research corpus (with copies distributed to other researchers) even if the publisher disapproves.
I can imagine building a model that is a derived work, but I don't see why all models trained on a corpus that contains some text can automatically be thought of as derived from that text in a creative sense. Models like GPT are using the text as a specific instance of the latent relationships that make up our language. The text is an example of english to learn from, it's not a creative work being extended.
Training a language model, I would argue, is equivalent (in terms of being derivative) to generating a list of the top ten words in a corpus of text. Not really a derived work.
One possible counterpoint- sample based explanation techniques can tell what training data was most influential in a ML model's prediction, and this has been considered in language models[0]. So you could argue that if there are training examples within a corpus that are having an outsized influence on the model output, then maybe it is derivative. This would be pretty cool to look at - are some GPT or other language model outputs relying strongly on a few sources?
It is probably not a derived work. To be a derived work it is not enough to simply use the original work in its construction; it must contain major copyrightable elements of the original work. The weights and parameters present in the model do not seem to fit that description; generally you can’t copyright a bunch of numbers. That said, I don’t think this has been confirmed by a major court case yet, and I wouldn’t be surprised if that happens at some point.
"Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned."
You did notice those files are both hosted on the same site, I trust? I suspect the mp4 speaks much more directly to the actual state of mind and beliefs of the creator(s) of the collection. In the context of this conversation (discussing the attitudes of many in the AI community towards privacy and intellectual property rights in training data), that would make the mp4 the more official statement of policy.
Are you objecting to the use of data for training, or to the compilation of training datasets?