*> Always impressive how the AI community, both inside large companies and not, ...

yodon · on Oct 25, 2020

> Are you objecting to the use of data for training, or to the compilation of training datasets?

The GP is objecting to the casualness with which privacy rights and intellectual property rights are ignored by so many in the AI community, not to a choice of whether they object to one or another specific manifestation of how the community is doing so.

[edit added]: Those who would argue "no, there is no casual ignoring of rights", consider for example the official DMCA takedown instructions for the collection [0].

[0] http://the-eye.eu/dmca.mp4

PeterisP · on Oct 26, 2020

IMHO the research community in general is definitely not ignoring intellectual property rights - for natural language processing, there are papers and publication tracks in major conferences specifically about the legal issues of corpora; many researchers are quite painfully aware of all the various legal restrictions that exist.

However, the legal environment is very different than in the commercial world or consumer piracy, as it generally involves various legal exemptions (differing between locales) that do allow such usage. For example, I work in NLP research on aspects that involve handling large corpora of copyrighted text. It's easier to do it with cooperation of the publishers for various practical reasons, however, we still can and do use also the works of the publishers who would refuse to grant any permission, because local copyright law has specific exceptions that allows the usage these works for noncommercial research purposes. Doing so is not ignoring their rights, their rights are not violated but rather they are limited; their exclusivity right (privilege would be a more appropriate word) to make copies is not absolute. There are even some countries with explicit legal duty for the publishers to provide digital versions of their works to national corpora where they will be used for (among other things) machine learning models.

The specific consequence, however, is that we can't legally share the full datasets which we are using with the public, like it was done in the original post with this particular dataset, as that would be a violation of the publishers' commercial rights; we can provide them to specific researchers for limited noncommercial purposes only. But I can download this dataset or one like this and use it my research legally; just as I can rip up a physical book, scan it, make a digital copy and OCR it, and use it in a research corpus (with copies distributed to other researchers) even if the publisher disapproves.

nix23 · on Oct 25, 2020

The question is..if a AI reads a book is it against copyright?

Or is the trained model a derived work of those books?

mdifrgechd · on Oct 25, 2020

I can imagine building a model that is a derived work, but I don't see why all models trained on a corpus that contains some text can automatically be thought of as derived from that text in a creative sense. Models like GPT are using the text as a specific instance of the latent relationships that make up our language. The text is an example of english to learn from, it's not a creative work being extended.

Training a language model, I would argue, is equivalent (in terms of being derivative) to generating a list of the top ten words in a corpus of text. Not really a derived work.

mdifrgechd · on Oct 25, 2020

One possible counterpoint- sample based explanation techniques can tell what training data was most influential in a ML model's prediction, and this has been considered in language models[0]. So you could argue that if there are training examples within a corpus that are having an outsized influence on the model output, then maybe it is derivative. This would be pretty cool to look at - are some GPT or other language model outputs relying strongly on a few sources?

[0] https://arxiv.org/abs/1810.03611

lacker · on Oct 25, 2020

It is probably not a derived work. To be a derived work it is not enough to simply use the original work in its construction; it must contain major copyrightable elements of the original work. The weights and parameters present in the model do not seem to fit that description; generally you can’t copyright a bunch of numbers. That said, I don’t think this has been confirmed by a major court case yet, and I wouldn’t be surprised if that happens at some point.

fakedang · on Oct 26, 2020

But for a dynamically learning machine model, can you even copyright the numbers?

mcguire · on Oct 25, 2020

The question is, "Do the creators of bibliotik have proper licenses to aggregate those books into a dataset?" The dataset is certainly a derived work.

Edit: https://the-eye.eu/public/Books/humble_books_20180509/

Cue the "they're depriving me of my income" complaints from authors.

nix23 · on Oct 25, 2020

>The dataset is certainly a derived work

That was not the question.

mcguire · on Oct 25, 2020

"Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned."

nix23 · on Oct 25, 2020

And you where thinking, lets rephrase that question because it's not already boring and obvious enough?

Spooky23 · on Oct 25, 2020

If AI has human like rights to read and learn and create derived works, what other rights does it have?

Perhaps we should figure that before we start a dystopian nightmare of written material.

ahoka · on Oct 25, 2020

Of course it is a derived work.

kerng · on Oct 25, 2020

Maybe not so obvious. If you read a book, are you a derivative work?

nix23 · on Oct 25, 2020

Soo reading the Kama Sutra leads to derivative "work" too :)

blackoil · on Oct 25, 2020

I read a Java book and created programs using knowledge given in the book. Is my program a derived work?

ahoka · on Oct 25, 2020

A book on Java was written to teach you Java, so it would be foolish to claim any copyright on your program.

nix23 · on Oct 25, 2020

That opens some interesting questions...

jhauris · on Oct 25, 2020

IANAL: I think it's pretty clear it's a derived work, the open question afaik is whether it's fair use our not.

isoprophlex · on Oct 25, 2020

Sweet jesus those instructions go on forever, imagine performing that (without breaking into laughter)

webmaven · on Oct 28, 2020

I'm pretty sure that it's the same few seconds looped for ten minutes.

yorwba · on Oct 25, 2020

The official DMCA takedown instructions are actually at http://the-eye.eu/dmca/ and look much less unusual.

yodon · on Oct 25, 2020

You did notice those files are both hosted on the same site, I trust? I suspect the mp4 speaks much more directly to the actual state of mind and beliefs of the creator(s) of the collection. In the context of this conversation (discussing the attitudes of many in the AI community towards privacy and intellectual property rights in training data), that would make the mp4 the more official statement of policy.