Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned.
Aside: it's interesting to think about the legal ramifications of copyrights in the face of AI. If my AI model "reads" a bunch of text, is the model now a "derived work" of the copyrighted work? Perhaps. But of course humans don't infringe when they watch/listen/read a copyrighted work and it changes their state of mind. Perhaps the model can be said not to infringe if it sufficiently abstracts concepts of the art in question. Typically when tuning a model we include lots of counter-cases to avoid overfitting - to make sure that the model is indeed abstracting concepts.
It is derivative. There’s then a question of whether the derived work is sufficiently transformative to be fair use, which depends on what the model is outputting.
But yeah that’s separate from the question of whether you properly licensed the data to train on in the first place.
A big chunk of the computing community seems to approach licensing as “I can see it, so I can use it.” (See GPL code used where it shouldn’t be.)
> But of course humans don't infringe when they watch/listen/read a copyrighted work and it changes their state of mind.
You're not generally distributing, performing, &c. your state of mind. On the other hand, if you then go and generate work based on what you read, then distributing that work certainly can be infringing. Thus "clean room" techniques, where one team reads copyrighted text, writes up a spec (which may then be checked off by lawyers), and then another team, without reading the copyrighted text, implements based off the spec, are sometimes used to attempt to launder copyright taint.
Taking a step back, the fact that well-known authors have infamously declared that they don't read fanfiction as a CYA move against accusations of plagiarism suggests that yes, brains that have read text are widely considered derivative.
> You're not generally distributing, performing, &c. your state of mind.
I don't think I agree. My employer values my contribution in some part as an oracle. People at work ask me questions and I answer them. Those answers come from the sum of my experiences (a biochemical 'model'). Other people more directly conduct public performances of their talent.
If nothing else this will likely blur the lines on what's considered original work.
Distributing an exact copy or paraphrased content of an original work would be subject to copyright, but if a model just generates its own inferences, I don't see how it's different from say a human generated review.
As technical civilization gains power, it becomes able to make things inbetween social categories. Is a simulation of a human brain a person? And one compressed to half-size? And compressed again? And again?
It's a grains-to-heap problem; eventually, you have to make an ugly choice.
Seriously. At one point I curated a similar dataset to OP, but because it was largely worked to which I didn’t own the copyright, there’s no way in hell I was going to distribute it to others (because that would be doing what that internet library got sued for). I can’t freely distribute the last 20 years of TV and movies just because it’s useful for academic research.
Hell, unless you paid for it, you shouldn’t even have the collection to train on, whether you distribute it or not.
Effectively, everyone’s focusing on whether your story(/model) is sufficiently transformative from Die Hard to be fair use for you to distribute it, without addressing the complaint that you snuck into the theater too, and held the door for others to follow me in.
> Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned.
Are you objecting to the use of data for training, or to the compilation of training datasets?
> Are you objecting to the use of data for training, or to the compilation of training datasets?
The GP is objecting to the casualness with which privacy rights and intellectual property rights are ignored by so many in the AI community, not to a choice of whether they object to one or another specific manifestation of how the community is doing so.
[edit added]: Those who would argue "no, there is no casual ignoring of rights", consider for example the official DMCA takedown instructions for the collection [0].
IMHO the research community in general is definitely not ignoring intellectual property rights - for natural language processing, there are papers and publication tracks in major conferences specifically about the legal issues of corpora; many researchers are quite painfully aware of all the various legal restrictions that exist.
However, the legal environment is very different than in the commercial world or consumer piracy, as it generally involves various legal exemptions (differing between locales) that do allow such usage. For example, I work in NLP research on aspects that involve handling large corpora of copyrighted text. It's easier to do it with cooperation of the publishers for various practical reasons, however, we still can and do use also the works of the publishers who would refuse to grant any permission, because local copyright law has specific exceptions that allows the usage these works for noncommercial research purposes. Doing so is not ignoring their rights, their rights are not violated but rather they are limited; their exclusivity right (privilege would be a more appropriate word) to make copies is not absolute. There are even some countries with explicit legal duty for the publishers to provide digital versions of their works to national corpora where they will be used for (among other things) machine learning models.
The specific consequence, however, is that we can't legally share the full datasets which we are using with the public, like it was done in the original post with this particular dataset, as that would be a violation of the publishers' commercial rights; we can provide them to specific researchers for limited noncommercial purposes only. But I can download this dataset or one like this and use it my research legally; just as I can rip up a physical book, scan it, make a digital copy and OCR it, and use it in a research corpus (with copies distributed to other researchers) even if the publisher disapproves.
I can imagine building a model that is a derived work, but I don't see why all models trained on a corpus that contains some text can automatically be thought of as derived from that text in a creative sense. Models like GPT are using the text as a specific instance of the latent relationships that make up our language. The text is an example of english to learn from, it's not a creative work being extended.
Training a language model, I would argue, is equivalent (in terms of being derivative) to generating a list of the top ten words in a corpus of text. Not really a derived work.
One possible counterpoint- sample based explanation techniques can tell what training data was most influential in a ML model's prediction, and this has been considered in language models[0]. So you could argue that if there are training examples within a corpus that are having an outsized influence on the model output, then maybe it is derivative. This would be pretty cool to look at - are some GPT or other language model outputs relying strongly on a few sources?
It is probably not a derived work. To be a derived work it is not enough to simply use the original work in its construction; it must contain major copyrightable elements of the original work. The weights and parameters present in the model do not seem to fit that description; generally you can’t copyright a bunch of numbers. That said, I don’t think this has been confirmed by a major court case yet, and I wouldn’t be surprised if that happens at some point.
"Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned."
You did notice those files are both hosted on the same site, I trust? I suspect the mp4 speaks much more directly to the actual state of mind and beliefs of the creator(s) of the collection. In the context of this conversation (discussing the attitudes of many in the AI community towards privacy and intellectual property rights in training data), that would make the mp4 the more official statement of policy.
Another related aspect of this is the hiQ vs. LinkedIn decision, which many interpreted as now setting precedent for allowing scraping of public information.
It's not that simple unfortunately, and companies could still send a C&D/threaten invoking the CFAA, and it would still be a mess. Although in practice it's not worth the effort for companies to sue as long as the scraper is not monetarily benefiting from it (which is how the court case happened).