Always impressive how the AI community, both inside large companies and not, see...

wyldfire · on Oct 25, 2020

Aside: it's interesting to think about the legal ramifications of copyrights in the face of AI. If my AI model "reads" a bunch of text, is the model now a "derived work" of the copyrighted work? Perhaps. But of course humans don't infringe when they watch/listen/read a copyrighted work and it changes their state of mind. Perhaps the model can be said not to infringe if it sufficiently abstracts concepts of the art in question. Typically when tuning a model we include lots of counter-cases to avoid overfitting - to make sure that the model is indeed abstracting concepts.

6gvONxR4sf7o · on Oct 25, 2020

It is derivative. There’s then a question of whether the derived work is sufficiently transformative to be fair use, which depends on what the model is outputting.

But yeah that’s separate from the question of whether you properly licensed the data to train on in the first place.

A big chunk of the computing community seems to approach licensing as “I can see it, so I can use it.” (See GPL code used where it shouldn’t be.)

naniwaduni · on Oct 26, 2020

> But of course humans don't infringe when they watch/listen/read a copyrighted work and it changes their state of mind.

You're not generally distributing, performing, &c. your state of mind. On the other hand, if you then go and generate work based on what you read, then distributing that work certainly can be infringing. Thus "clean room" techniques, where one team reads copyrighted text, writes up a spec (which may then be checked off by lawyers), and then another team, without reading the copyrighted text, implements based off the spec, are sometimes used to attempt to launder copyright taint.

Taking a step back, the fact that well-known authors have infamously declared that they don't read fanfiction as a CYA move against accusations of plagiarism suggests that yes, brains that have read text are widely considered derivative.

wyldfire · on Oct 27, 2020

> You're not generally distributing, performing, &c. your state of mind.

I don't think I agree. My employer values my contribution in some part as an oracle. People at work ask me questions and I answer them. Those answers come from the sum of my experiences (a biochemical 'model'). Other people more directly conduct public performances of their talent.

If nothing else this will likely blur the lines on what's considered original work.

fakedang · on Oct 26, 2020

Distributing an exact copy or paraphrased content of an original work would be subject to copyright, but if a model just generates its own inferences, I don't see how it's different from say a human generated review.

naniwaduni · on Oct 27, 2020

Human generated reviews are generally considered derivative. That's why fair use &c. doctrines have factors expressly to enable them.

corobo · on Oct 25, 2020

I'd imagine one day this will be determined by whether you can copy the resulting 'brain' to another system

It'll work well until we can upload ourselves to the cloud and we'll have to revisit it

curtainsforus · on Oct 25, 2020

As technical civilization gains power, it becomes able to make things inbetween social categories. Is a simulation of a human brain a person? And one compressed to half-size? And compressed again? And again?

It's a grains-to-heap problem; eventually, you have to make an ugly choice.

6gvONxR4sf7o · on Oct 25, 2020

Seriously. At one point I curated a similar dataset to OP, but because it was largely worked to which I didn’t own the copyright, there’s no way in hell I was going to distribute it to others (because that would be doing what that internet library got sued for). I can’t freely distribute the last 20 years of TV and movies just because it’s useful for academic research.

Hell, unless you paid for it, you shouldn’t even have the collection to train on, whether you distribute it or not.

Effectively, everyone’s focusing on whether your story(/model) is sufficiently transformative from Die Hard to be fair use for you to distribute it, without addressing the complaint that you snuck into the theater too, and held the door for others to follow me in.

webmaven · on Oct 25, 2020

> Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned.

Are you objecting to the use of data for training, or to the compilation of training datasets?

yodon · on Oct 25, 2020

> Are you objecting to the use of data for training, or to the compilation of training datasets?

The GP is objecting to the casualness with which privacy rights and intellectual property rights are ignored by so many in the AI community, not to a choice of whether they object to one or another specific manifestation of how the community is doing so.

[edit added]: Those who would argue "no, there is no casual ignoring of rights", consider for example the official DMCA takedown instructions for the collection [0].

[0] http://the-eye.eu/dmca.mp4

PeterisP · on Oct 26, 2020

IMHO the research community in general is definitely not ignoring intellectual property rights - for natural language processing, there are papers and publication tracks in major conferences specifically about the legal issues of corpora; many researchers are quite painfully aware of all the various legal restrictions that exist.

However, the legal environment is very different than in the commercial world or consumer piracy, as it generally involves various legal exemptions (differing between locales) that do allow such usage. For example, I work in NLP research on aspects that involve handling large corpora of copyrighted text. It's easier to do it with cooperation of the publishers for various practical reasons, however, we still can and do use also the works of the publishers who would refuse to grant any permission, because local copyright law has specific exceptions that allows the usage these works for noncommercial research purposes. Doing so is not ignoring their rights, their rights are not violated but rather they are limited; their exclusivity right (privilege would be a more appropriate word) to make copies is not absolute. There are even some countries with explicit legal duty for the publishers to provide digital versions of their works to national corpora where they will be used for (among other things) machine learning models.

The specific consequence, however, is that we can't legally share the full datasets which we are using with the public, like it was done in the original post with this particular dataset, as that would be a violation of the publishers' commercial rights; we can provide them to specific researchers for limited noncommercial purposes only. But I can download this dataset or one like this and use it my research legally; just as I can rip up a physical book, scan it, make a digital copy and OCR it, and use it in a research corpus (with copies distributed to other researchers) even if the publisher disapproves.

nix23 · on Oct 25, 2020

The question is..if a AI reads a book is it against copyright?

Or is the trained model a derived work of those books?

mdifrgechd · on Oct 25, 2020

I can imagine building a model that is a derived work, but I don't see why all models trained on a corpus that contains some text can automatically be thought of as derived from that text in a creative sense. Models like GPT are using the text as a specific instance of the latent relationships that make up our language. The text is an example of english to learn from, it's not a creative work being extended.

Training a language model, I would argue, is equivalent (in terms of being derivative) to generating a list of the top ten words in a corpus of text. Not really a derived work.

mdifrgechd · on Oct 25, 2020

One possible counterpoint- sample based explanation techniques can tell what training data was most influential in a ML model's prediction, and this has been considered in language models[0]. So you could argue that if there are training examples within a corpus that are having an outsized influence on the model output, then maybe it is derivative. This would be pretty cool to look at - are some GPT or other language model outputs relying strongly on a few sources?

[0] https://arxiv.org/abs/1810.03611

lacker · on Oct 25, 2020

It is probably not a derived work. To be a derived work it is not enough to simply use the original work in its construction; it must contain major copyrightable elements of the original work. The weights and parameters present in the model do not seem to fit that description; generally you can’t copyright a bunch of numbers. That said, I don’t think this has been confirmed by a major court case yet, and I wouldn’t be surprised if that happens at some point.

fakedang · on Oct 26, 2020

But for a dynamically learning machine model, can you even copyright the numbers?

mcguire · on Oct 25, 2020

The question is, "Do the creators of bibliotik have proper licenses to aggregate those books into a dataset?" The dataset is certainly a derived work.

Edit: https://the-eye.eu/public/Books/humble_books_20180509/

Cue the "they're depriving me of my income" complaints from authors.

nix23 · on Oct 25, 2020

>The dataset is certainly a derived work

That was not the question.

mcguire · on Oct 25, 2020

"Always impressive how the AI community, both inside large companies and not, seems to assume data just exists for them to use, copyright or personal rights be damned."

nix23 · on Oct 25, 2020

And you where thinking, lets rephrase that question because it's not already boring and obvious enough?

Spooky23 · on Oct 25, 2020

If AI has human like rights to read and learn and create derived works, what other rights does it have?

Perhaps we should figure that before we start a dystopian nightmare of written material.

ahoka · on Oct 25, 2020

Of course it is a derived work.

kerng · on Oct 25, 2020

Maybe not so obvious. If you read a book, are you a derivative work?

nix23 · on Oct 25, 2020

Soo reading the Kama Sutra leads to derivative "work" too :)

blackoil · on Oct 25, 2020

I read a Java book and created programs using knowledge given in the book. Is my program a derived work?

ahoka · on Oct 25, 2020

A book on Java was written to teach you Java, so it would be foolish to claim any copyright on your program.

nix23 · on Oct 25, 2020

That opens some interesting questions...

jhauris · on Oct 25, 2020

IANAL: I think it's pretty clear it's a derived work, the open question afaik is whether it's fair use our not.

isoprophlex · on Oct 25, 2020

Sweet jesus those instructions go on forever, imagine performing that (without breaking into laughter)

webmaven · on Oct 28, 2020

I'm pretty sure that it's the same few seconds looped for ten minutes.

yorwba · on Oct 25, 2020

The official DMCA takedown instructions are actually at http://the-eye.eu/dmca/ and look much less unusual.

yodon · on Oct 25, 2020

You did notice those files are both hosted on the same site, I trust? I suspect the mp4 speaks much more directly to the actual state of mind and beliefs of the creator(s) of the collection. In the context of this conversation (discussing the attitudes of many in the AI community towards privacy and intellectual property rights in training data), that would make the mp4 the more official statement of policy.

pixl97 · on Oct 25, 2020

Eh, this entire comment thread is leading into "RMS was right" territory. Copyright and our machine intelligence future can't mix well at all.

Shared404 · on Oct 26, 2020

To be fair, I'm pretty sure that's a pretty common sentiment around HN.

minimaxir · on Oct 25, 2020

Another related aspect of this is the hiQ vs. LinkedIn decision, which many interpreted as now setting precedent for allowing scraping of public information.

It's not that simple unfortunately, and companies could still send a C&D/threaten invoking the CFAA, and it would still be a mess. Although in practice it's not worth the effort for companies to sue as long as the scraper is not monetarily benefiting from it (which is how the court case happened).

panpanna · on Oct 25, 2020

You just described Google.