OpenAI's justification for why training data is fair use, not infringement [pdf]

buildbot · on Oct 5, 2023

Skimming, this quote by itself seems enough to solidly make their case:

>>> "The “object of the original creation,” in other words, is direct human consumption of the author’s expression. 18 Intermediate copying of works in training AI systems is, by contrast, “non-expressive”:19 the copying helps computer programs learn the patterns inherent in human-generated media. The aim of this process—creation of a useful generative AI system—is quite different than the original object of human consumption.20 The output is different too: nobody looking to read a specific webpage contained in the corpus used to train an AI system can do so by studying the AI system or its outputs. The new purpose and expression are thus both highly transformative.21"

Like you can make a transformer output some part of it's training data - but not an entire work, and more to the point, that's not the point (to copy) of the model or it's training data.

Palmik · on Oct 5, 2023

I think we should not stop innovation, but let's not minimize the impact on the producers of the content (the whole of humanity).

First of all, even if the models never reproduce any of the copyrighted content verbatim, they will soon be good enough that they will be able to replace the work of the people [1] that produced the content the AI was trained on, like writers and programmers.

Second of all, the nature of the use of the copyrighted content is purely for-profit, and the ability to squeeze the profit out of the best models will stay with the largest corporations -- effectively transferring wealth from the people that produced the content, to the corporations.

We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair. Who wrote which paper [2], or who supplied the initial GPU training, should not really matter that much in the grand scheme of things.

[1] or at least cause downward pressure on their earning potential

[2] you could use similar arguments here they use for using the copyrighted content -- each paper (except maybe the Attention is all you need paper) contributed only marginally

sillysaurusx · on Oct 5, 2023

> We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair.

Hear, hear. Unfortunately this is more or less impossible given current copyright law.

Suppose you scrape libgen and turn it into training data, then you release the training data publicly. Since the vast majority of every book appears verbatim in the training data, is this sufficiently transformative?

I think yes, it is, because nobody is going to read those books from the training data. When I made books3, I felt it was important to render each book into high quality text. But it turns out that when you convert Jurassic Park into a text file, there's no good way to read it anymore. Good luck trying to bookmark wherever you left off -- it's all one gigantic file.

But nobody seems to agree. The Danish Rights Alliance (https://rettighedsalliancen.com/) aggressively DMCA'ed anyone that hosted books3, even going so far as to DMCA The Pile from academictorrents: https://academictorrents.com/details/0d366035664fdf51cfbe9f7... with the justification that ~100 copyrighted books appear in the training data, so therefore they have the right to DMCA. Right now most of the world seems to agree, but I'm hoping that opinion will shift as the years tick by. Surely no one can believe that a plain text document poses a serious threat of economic harm to the original author. So the question is whether the original author should be allowed to deny everyone else the right to transform their work into a form that machines can read.

For my part, I've been planning a books4 dataset, but this time similar to LAION: it's a script that spiders libgen torrents (https://libgen.gs/torrents/libgen/) and converts all the epubs into text files. That way, if LAION isn't infringing, then books4 can't be infringing either. (Of course, hosting the actual training data anywhere is pretty hard nowadays, but it should only take a few days to convert 38TB of libgen into ~2TB of plain text.)

This is the only way to create an open source competitor to ChatGPT.

Al-Khwarizmi · on Oct 5, 2023

I'm actually on your side because I think copyright laws should be radically nerfed and things like books3 are greatly beneficial to society, but I wouldn't buy your argument about "no good way to read plain text files".

The text file contents all the text in the book in an (at the very least) machine-readable way, it is perfectly feasible to write a program to put it in ebook form or to play it as an audiobook, and then the text file becomes desirable for laypeople to read.

sillysaurusx · on Oct 5, 2023

Unfortunately this is true of any encoding scheme, short of scrambling the order of the paragraphs. And although scrambling the order might seem tempting, it destroys the ability to train large context windows — 32k context tokens is enough to fit most of a book into a prompt, and this window will only grow bigger.

If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows. That’s why it seemed important to justify a plain text training format, since anything on top of it would be equivalent.

Indeed, one alternate training format would be to ship the raw html from every epub file, then process it into text at runtime. But this makes it trivial to reconstruct the original epub file and use it in an actual book reader.

It’s frustrating that we can’t share the epub files, because there are so many advantages: you can scrape the metadata, you can tweak the rendering to plain text, you can get semantic info from the images (and even OCR them — it turns out that lots of coding epubs show code examples as screenshots, because epub html rendering is so primitive, so this would be the only way to let your model learn from those).

All of that is why I’m leaning towards "make a script to spider all of libgen and cache the epub files locally". But I haven’t finished calculating how much disk space this would require.

I’m sad that researchers will have to wait days for their training data instead of downloading it in a few hours from a high speed cache, but it seems like any such cache would be swiftly DMCA’ed, so there’s no alternative.

dcow · on Oct 5, 2023

I don't know if this is tangential or not, but thank you. You've helped me progress my armchair understanding of how we might give the concept of copyright more finesse in the digital age.

The insight is that encoding content is not functionally the same as copying verbatim a work which is the original intent of the concept.

For example: if I have legally obtained a copy of e.g. https://archive.org/details/free_culture, then I am at liberty to encode it in whatever format I need to be able to feed it to a machine/tool for whatever purpose I want. I am not infringing on copyright because I legally obtained the work, and the machine is not because machines can't.

I think at least this much OpenAI has in their favor. If they can prove they legally obtained all the training material (I do think it's fair for them to be required to pay once for it) then I don't think there's any world in which it makes sense to try and allow content creators to further extract royalties from that process alone.

If a user asks an AI for a copyrighted poem, for instance, and the user goes and republishes that poem as their own, I do think it may make sense to grant the original author royalties under current law.

I really hope we can legally pick these two concepts apart and focus on each scenario independently. I see a lot of people here arguing that giving a work that you have legally obtained the rights to view to a machine model is inherently copyright infringement because you had to copy the work and yada yada is it fair use? I really think this is wrong both in the interpretation of copyright and practically as ideas can't be owned and it makes no sense to limit which ideas were used to make what commercial product etc. (we don't do that today and AI doesn't change that).

sangnoir · on Oct 5, 2023

>If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows

Only if the training data is under copyright! How about data that is in the public domain, or granted license specifically for training - Mozilla's commonvoice is an example.

Maybe someday there will be a rulings that do for ML training datasets what earlier ones did to legalize cleanroom design.

sillysaurusx · on Oct 5, 2023

Sadly there just isn’t enough public domain data. It would mean that it’ll be impossible to catch up to ChatGPT.

One way to see this is to imagine a Midjourney competitor trained solely on public domain images. The visual quality of the model will always be worse.

As for licensing, I agree for commercial entities, but there should be exceptions. If a model is open source, it benefits everyone, and so it shouldn’t need to have been licensed. There are a few reasons why this is pretty important, but the main one is that without it, the open source community has no chance whatsoever of creating cutting edge models.

dcow · on Oct 6, 2023

I think you can reasonably imagine that if content is available in a library then everyone could organize an effort to check out the content and add it to the model. That effort sounds like useless theatre at that point so just let the commons keep an `all-booksN.zip` corpus around for the purpose.

You know, you could probably even argue traditionally that taking all the books in the world and adding them to a corpus for the purpose of creating an LLM would be a transformative work since it doesn't compete with or detract from any of the originals…

stavros · on Oct 5, 2023

Why can't I read or bookmark a huge text file? I can and have. Books3 (though this is the first time I'm hearing of it) shouldn't be fair use because it's inconvenient to read, it should be fair use because reading is not the intent. The intent is to train computers with it.

jdkee · on Oct 5, 2023

It does contain copies of copyrighted works.

stavros · on Oct 5, 2023

Well, yes, otherwise you wouldn't need fair use, you'd just use it.

benxh · on Oct 5, 2023

I would like to get in touch with you related to books4. Do you happen to have discord? or would twitter be ok?

There's currently multiple attempts at creating what you describe as books4.

sillysaurusx · on Oct 5, 2023

Please do! Twitter DM is the most reliable. Or if you put some contact info in your profile I can reach out.

benxh · on Oct 7, 2023

Added, and reached out on Twitter.

mywittyname · on Oct 5, 2023

I'm okay with the idea of no copyrights at all on works that are fully or partially AI derived. Of course, that introduces even more thorny arguments (what legally constitutes "AI"), but the courts can work that out.

It's clear already that companies are going to be want to have their cake and eat it too. When they leverage AI to produce content, they don't want to be forced to pay licensing fees on the inputs to their models. But on the flip side, they absolutely don't want their content being used in models leveraged by other people.

Not allowing copyrights on wholly or partially AI-generated work is probably the lesser evil path to take.

Pet_Ant · on Oct 5, 2023

I mean we never seem to care about the downward pressures we put on others, so this seems a contrived place to do it now.

My take on the matter is that facts aren't copyrightable. A LLM is basically recording facts.

I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts. You could easily now generate random sentences and filter ones that pass a certain threshold. LLMs are basically this on steroids. So it's just facts and data, not copyright infringement.

Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.

raincole · on Oct 5, 2023

> I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts

The logical jumping happens around this period symbol is spectacular.

Pet_Ant · on Oct 5, 2023

I was assuming people were familiar with the process and handwaving the rest.

You measure things like:

* What average sentence length?

* What is the ratio of adjectives to nouns?

* What percentage of sentences are in a passive voice?

* What is the distribution of words used?

* Is "the" used more often by the other than in general usage?

* "whence" vs "when".

* "tyre" vs "tire". Etc.

This could all be printed out and sold as a book and each measure would be a very boring fact that is not copyrightable and contains no copyrightable content.

I could give you a word frequency list from the A Song of Fire and Ice series and George RR Martin could do nothing about it. (Technically, he might have the rights to "Arya" and "Lannister", but to say "the word 'Lannister' appears 1,337 times in the series" would clearly fall under fair-use).

raincole · on Oct 5, 2023

Sorry it's nonsense. You're basically saying:

1. Anything existing is considered a fact.

2. Stating a fact doesn't infringe copyright.

The only conclusion your logic can infer to is copyright doesn't exist.

For example, I can say "the first sentence of A Song of Fire and Ice series is <insert the first sentence>, the second sentence is <insert the second sentence>, the third sentence is...". It's still just a list of facts, so no copyright infringement, right?

Pet_Ant · on Oct 6, 2023

I think it that case if you went through the whole book like that a judge would argue that you are infringing because you can recover the whole book like that.

But if you went by and grabbed the very first sentence from each chapter that's is sufficiently in the clear. Especially when you are doing something transformative with it. Like for example analysing what makes for an effective opening sentence.

With copyright you always get to ugly slippery slope arguments, Tom Scott has a great video on it.

But if you cannot extract something the replaces the original work you should be already in the clear. I'm pretty sure Cole's Notes don't need the original author's permission to print and they are arguably a substitute for reading the original work. I've skimmed through entire series by reading the individual episode summaries on Wikipedia. These are in the clear.

The fact that a machine speeds up a human process does not change legality.

bugglebeetle · on Oct 5, 2023

How would reproduction of fictional works or the creation of derivative works based on fictional training inputs constitute “a recording of facts”?

falcor84 · on Oct 5, 2023

I think you're conflating two issues. As I understand it, this justification is about whether the training itself is fair-use. The discussion on whether the outputs are derivative or transformative is separate. Analogously, if I publish a copy of copyrighted poems from memory that would be infringement, but there's nothing infringing about me just reading and memorizing them.

TeMPOraL · on Oct 5, 2023

LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense. "Reproduction of fictional works" is what is present in the training data. "Creation of derivative works based on fictional training inputs" is what you might be doing as a user of a LLM. The models themselves are, like GP said, recordings of statistical properties of text, just taken up to 11.

raincole · on Oct 5, 2023

> do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Except it does, and it's in a quite literal sense.

You painted a painting. I took a picture of it and compress it as .jpeg. Byte-wise the .jpeg file has no similarity to your original painting. For someone who doesn't know .jpeg, it's just garbage bytes.

So I didn't infringe your copyright by selling this .jpeg file. The user who decodes the .jpeg file and displays it on a monitor does.

Does it sound right? This is how "weights do not contain training data" argument works.

And before "how about artists who store this information in their brains?" Well human being is a special case for every law in every country. Just like selling a cow's liver is never the same as selling a man's, even they're both organic tissues. A human's brain is always going to be treated differently than a hard drive.

TeMPOraL · on Oct 5, 2023

My argument is that the modes are somewhere between your JPEG example and "artists' brains" example. Now legally, it's usually not the bytes themselves that matter, but their colour (provenance). But this doesn't make the case of AI models any more clear - the training process is also somewhere between zipping up a folder full of JPEGs, and a human practicing their art through fuck ton of repetitive reproductions of existing works until they grok the style.

raincole · on Oct 5, 2023

I agree.

The thing I hate about this whole situation is that it's going to be decided solely on which side has more lobbying power[1], since just as you said, it's something in between.

[1]: One could say all the policies are decided this way...

TeMPOraL · on Oct 5, 2023

> [1]: One could say all the policies are decided this way...

The silver lining here is that lobbyists can only afford to care about finite number of things at a time - so even if they get some laws their way, other factions can push laws mitigating the damage somewhat.

I currently believe (weakly) that the outcome of more and better lobbying isn't laws getting directly worse for society, but rather the regulatory system grinding to a halt under increasing number of laws that exist to cancel out parts of other laws...

dragonwriter · on Oct 5, 2023

> LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Since portions of such works can be recovered in inference, they contain at least a lossily-compressed copy of the collection of works used in training. Reproduction isn’t pure coincidence.

TeMPOraL · on Oct 5, 2023

Indeed, but learning is effectively[0] a form of lossy compression too. DNN weights are somewhere between zipping up a truckload of JPEGs and deriving facts from first principles. Where exactly are they on this spectrum and how it affects copyright, it's not obvious to me - at least not when trying to argue from fundamental principles. I feel the DNN copyright issues will be ruled on from purely pragmatic position: what's the legal status that upends existing markets the least, and/or is most favored by lobbyists.

--

[0] - I believe it's actually fundamentally the same thing.

dragonwriter · on Oct 5, 2023

> Indeed, but learning is effectively a form of lossy compression

Yeah, that copies in the brain of a human who has experienced the work and not reduced to any othe media are neither “copies” nor “phonorecords” as covered by US copyright law is pretty clear, and that this does not apply to data stored by clever algorithms in computer storage is also pretty clear, so I’m not sure what your point is.

If you want to argue LLMs are people to escape this, then, sure, copyright stops being a problem for their training data, but it still is for their output plus you end up a whole set of new legal problems with using them as people do now, starting with 13th Amendment problems.

Turing_Machine · on Oct 6, 2023

> that this does not apply to data stored by clever algorithms in computer storage is also pretty clear,

I'm sorry, this is far from "clear", otherwise we wouldn't be having this discussion. Fair use is a thing. Does fair use apply in this situation? No one knows at this point.

gs17 · on Oct 5, 2023

There's a pretty big jump from "machine learning is not inherently different from human learning" to "machine learning models are people".

dragonwriter · on Oct 5, 2023

There's not a big jump from “human learning isn't treated as a copyright violation because it happens in people and not in external media” plus “machine learning should be legally treated like human learning” to “your argument depends on treating ML models as people”.

The fact that people making the second part of the first argument often either are ignorant of or deliberately ignore why human learning is treated the way it is doesn't change the essence of the situation.

TeMPOraL · on Oct 5, 2023

That applies to both sides though: people arguing that machine learning should not be treated like human learning often justify it with "ML models aren't people, they're software on a computer". But that argument is also missing the point, and invites this kind of philosophical discussion. The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law. So we can legally round all machine learning down to "zipping a folder full of JPEGs" for pragmatic reasons, and get away with it until we actually create sentient AIs and recognize them as people.

(Which, knowing history and observing how we treat animals, will happen only when the sentient AIs coerce us to recognize them, through violence or the threat of it...)

dragonwriter · on Oct 5, 2023

> The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law.

“The purpose of law is to advance the collective interests of humans in a society” is, I would argue, a high-level principal, and (with variations as to whether it applies to all humans or some subset, and which interests are considered privileged relative to others) nearly universal. Yes, the fact that humans are writing the laws is a reason that principle is chosen, but the idea that there are not high-level principle is just false.

gs17 · on Oct 5, 2023

>“your argument depends on treating ML models as people”

They would be "legal people" in the sense that corporations are "people". Some jurisdictions have even granted personhood to non-sentient objects, such as rivers. There's no reason to get held up on the exact word "person" here.

thfuran · on Oct 5, 2023

>Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.

How so?

horsawlarway · on Oct 5, 2023

Because that's all the ai model is doing. It's reading the books.

How an entity (human or not) changes after reading those books is not in the purview of copyright.

It has not shared them (copied). It has read them, learned from them, and changed itself in response.

pbhjpbhj · on Oct 5, 2023

In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means (eg as a file). In USA that copying might be Fair Use, it's almost certainly not allowed in UK law AIUI.

Copying a webpage into a cache that allows presentation of the page to a user is only allowed because it is part of rendering the page to a user. Even if a computer only copies two words at a time from a source text, if it copies a substantial part overall then it still copied.

Honestly, I don't think slurping data to train ML models is allowed by copyright (but, I do, probably think it should be, as long as any significant reproductions are then prosecuted as infringements [also we should reduce copyright terms to ~7 years!]).

This is all my own opinion, unrelated to my employment.

Pet_Ant · on Oct 5, 2023

> In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means

Almost all modern books are available as ebooks for the Kindle or otherwise. You aren't doing anything to the content itself that could be a violation.

pbhjpbhj · on Oct 5, 2023

So, you copy the Kindle book from a server to your local host to feed into your training algo (ie other means). Or you stream it piecewise into a buffer, still copying.

thfuran · on Oct 5, 2023

As long as you only train the network and never use it for inference, I suppose that's a reasonable argument. But a person is restricted in action by copyright regardless of the writing implement they choose to use. Whether they use a pencil or an LLM, they cannot freely reproduce copyrighted works in whole or in part, excepting some narrow conditions.

cmiles74 · on Oct 5, 2023

IMHO, an LLM is not "reading" a book in any way, certainly there is no parallel with how people read. It is encoding the book. That's how it can regurgitate chunks of that book later, including, in some contexts, by providing verbatim spans of text from the book.

SirMaster · on Oct 5, 2023

Why is an AI model allowed to read a book without paying for access to it, but I as a human have to pay for it to read it?

hashtag-til · on Oct 5, 2023

The AI should, no doubt, pay to read the book!

In technical terms, these companies are trying to avoid the need to consider content licensing, which is a major violation to content creators.

Pet_Ant · on Oct 5, 2023

You aren't licensing it. You are reading it. A machine reading it is no different than a human. I mean it really shouldn't be. Should a visually impaired person using a screen reader have to "license" the book?

hashtag-til · on Oct 5, 2023

People come up with the most bizarre corner cases to justify freeloading on other’s intellectual property.

Content is obviously being relicensed as the models are not fully open.

On top of that, it is also being exposed as pay-for-use subscription models without paying anything to original authors. How is that fair use?

pclmulqdq · on Oct 5, 2023

That particular transformation of the book does well on a fair use analysis. If the blind person were to use the screen reader to put the book on Spotify, it would almost certainly fail a fair use analysis.

SirMaster · on Oct 5, 2023

But not all the works that have been consumed by LLMs are available to read without a license to do so.

troupe · on Oct 5, 2023

You can check the book out at the library and read it. Someone has to buy it, but not necessarily you personally.

SirMaster · on Oct 5, 2023

You can check out some books at libraries for free, sure.

LLMs are trained on tons of books and papers that are not available to humans for free anywhere.

troupe · on Oct 5, 2023

Can you give an example of something that they are trained on that a human can't somehow read without buying?

SirMaster · on Oct 5, 2023

Textbooks for 1 example.

You are telling me that all the books in books2 and books3 can be acquired and read by myself completely free of charge legally without piracy?

If humans can get these books for free legally, why are so many people paying so much to buy them for their college courses?

I thought it's well known that LLMs are being trained on pirated content.

troupe · on Oct 5, 2023

Textbooks aren't somehow banned from libraries. You'll find lots of textbooks in public libraries, school libraries, and other libraries where they can be read without paying for the book.

Now if you are saying they obtained the book from illegal sources that is a valid argument. But the issue there is orthogonal feeding the contents of the books into an LLM. It is legal for me to view and photograph the Mona Lisa. If I break into he Louvre to view the Mona Lisa, the issue isn't that I viewed the artwork. The issue is that I broke into the museum.

troupe · on Oct 5, 2023

Yes but then the issue isn't whether copyrighted works can be used as training data, but how you can obtain those works to use them as training data. I agree that you can't justify copyright infringement based on the fact you are using it as training data. But I do not agree with those that say using a legally borrowed book as training data violates copyright.

But if you have an example of some types of books that it is impossible to borrow from anywhere on the planet please share. Textbooks are readily available from libraries.

SirMaster · on Oct 5, 2023

No, they aren't banned, but not being able to find one at a library is not a legal justification for me to go and torrent it. It shouldn't be for a corporation training an LLM either.

It's widely known that some of the big LLMs were trained on book datasets that were acquired via torrents. And that those datasets do contain at least some books that are not available anywhere freely via legal channels.

stale2002 · on Oct 5, 2023

There is nothing illegal about reading a book that you haven't paid for.

What's illegal is making and distributing copies of a book.

But the reader isn't in trouble for reading a book they don't own.

SirMaster · on Oct 5, 2023

Really?

So I am free to torrent all the books that I want, and there is no legal action that can ever be taken against me?

If so that would be news to me.

stale2002 · on Oct 5, 2023

If you start distributing torrents and copies to other people?

No of course not.

Creating copies, and distributing them to other people is against the law.

That has nothing to do with what I said though.

What I said is that it is not illegal for a reader to read something that they havent paid for.

Do you see how that is different from distributing copies of something to other people?

SirMaster · on Oct 5, 2023

That's not what I am asking though...

I am asking about reading content that is not available through any free means, and pirating it in order to read it instead of paying for it from the author.

stale2002 · on Oct 5, 2023

> I am asking about reading content that is not available through any free means, and having to pirate it in order to read it.

Reading it is legal.

Distributing torrents is not though. That is the part of piracy that is illegal. The illegal part is when you create and distributing copies.

Copyright law has nothing to do with reading stuff that you have not paid for.

Instead, it is about the illegality of creating and distributing copies.

That's why it's called "copyright" law. Because it is about copies.

It is not "readright law".

No, there is nothing illegal about reading something that you havent paid for.

SirMaster · on Oct 5, 2023

I didn't know it was legal to download books from torrents that are normally only available from the author for a cost.

What about movies and tv shows? I can download them via torrents to view without paying as well, completely legally?

Nobody can ever take any legal action against me ever for doing this?

How though would the creator of the content ever get paid for it if everyone can read and view it completely free? Why would they go through all the work of creating it with no gain to themselves?

Karunamon · on Oct 5, 2023

Torrents are a somewhat special case since, by protocol, downloading and sharing happen at the same time. So by running a torrent of pirated media, you are necessarily infringing copyright (there are hacked clients that report bogus information that allow you to download only without sharing, but they are easily detectable and ~nobody uses them).

As far as I know, nobody in this country has ever been successfully prosecuted for the mere act of downloading, or even having obviously-illicitly-acquired pirated media in their possession.

As to your final question, the actual answer is because people are lazy. It requires more effort and technical sophistication to maintain a movie file collection than it does to run Netflix. Nearly any piece of remotely popular content is available for free with a bit of knowledge, but it's less hassle to just subscribe to services. There is no world in which piracy actually endangers creators.

SirMaster · on Oct 5, 2023

I should have used an example other than torrenting, one that doesn't automatically distribute back.

Also, I'd say there is a difference between being illegal and whether someone has been successfully prosecuted or not.

I don't understand though how piracy doesn't endanger creators though. If a small time creator creates something, say a video game, and everyone pirates it instead of buys it, how has that not endangered him? Simply because you don't think it's plausible that enough people would refrain from paying?

Karunamon · on Oct 6, 2023

Pretty much. For the simple reason that the "legitimate" way is always easier and the people that pirate are always a minority as a result.

stale2002 · on Oct 5, 2023

> Nobody can ever take any legal action against me ever for doing this?

They can if you create or distribute copies.

That part is illegal and will get you in trouble, and is why people get in trouble for torrenting.

Just reading or watching other people's content isn't the illegal part.

relativ575 · on Oct 5, 2023

I think it's safe to assume that AI has to pay for the book as well, sooner or later, as shown in the deals OpenAI has made with content publishers such as AP [0]. The question is how much they should pay.

[0] - https://www.pymnts.com/digital-payments/2023/58percent-of-co...

Tadpole9181 · on Oct 5, 2023

That's not the argument being made, is it? If the AI is trained on pirated data, it's a completely different case than the underlying principle of AI meaning all generated content is copyright infringement.

SirMaster · on Oct 5, 2023

But I thought they were being trained on pirated data...

Seems like a problem to me.

Tadpole9181 · on Oct 5, 2023

They're being trained on copyrighted data that was publicly accessible. The lawsuits focus on whether or not this is copyright violation, not the legality of accessing the material (AFAICT).

SirMaster · on Oct 5, 2023

So the world is just ignoring the legality of accessing the material in the way that they did?

Man, I wish as a human I could use the defense of "but it was publicly accessible in a torrent" as a valid reason that I acquired and consumed some content.

Tadpole9181 · on Oct 5, 2023

Sorry, do you have some proof that they pirated content that the scrupulous copyright holders themselves don't? That's not the issue at hand and, AFAICT, nobody is accusing them of doing it.

Nobody is ignoring it, that would be a crime if they did, but since there's no evidence of it... you're being upset at your own hypothetical.

SirMaster · on Oct 5, 2023

I'm really confused.

I see articles all over that LLMs for example used books3, which the creator himself has admitted came from torrenting from bibliotik, and which contains at least some books that are not otherwise freely available.

The issue that content creators have with LLMs is not getting proper attribution for their creations and source material that went into training these LLMs, and it seems pretty clear to me that some of the content that was used to train LLMs was not legally obtained and licensed to consume.

brbrodude · on Oct 5, 2023

'we' who?

skerit · on Oct 5, 2023

> We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair

Agreed. The irony is that all the current anti-machine-learning lawsuits will make it so only big companies with deep pockets will be able to continue to make & use them.

troupe · on Oct 5, 2023

> they will soon be good enough that they will be able to replace the work of the people that produced the content the AI was trained on

Let assume this is true. It then follows that if we train LLM on all books that are out of copyright, the LLM would be able to replace the "work of the people that produced the content" that is just about to go out of copyright. Feed this body of work back into the LLM and you should end up with something just as advance as what you get by feeding in works that are under copyright.

I'm not sure the assumption is true. But if it is true, using copyrighted material or only non-copyrighted material shouldn't result in a different situation.

Certhas · on Oct 5, 2023

So we need new legislation.

Edit: To be more explicit, on my view both the parent and the op are right, and they don't contradict each other: We don't need new copyright legislation. We need new legal concepts.

I think both people are right: There is an overwhelming need to address the situation of artists and writers ("content creators"). Copyright is not being used because it's the right tool to do so, but because it's the only even vaguely plausible one. It's also clear that it's not sufficient.

troupe · on Oct 5, 2023

We have legal concepts that fit. Ideas cannot be copyrighted. The particular expression of an idea can be copyrighted, but an idea itself is in the public domain. The ideas you or an LLM get from reading a book are in the public domain while the book (the expression of those ideas) is not.

In what way is that not sufficient?

Certhas · on Oct 6, 2023

I don't understand your argument. I concur with the GP (https://news.ycombinator.com/item?id=37780772) that there are legitimate issues around LLMs that need addressing. I say that current legal principles seem inadequate to address these issues. Your "counterpoint" seems to merely reiterate the latter. You seem to agree that these concepts can not be used effectively to address issues around LLMs pointed out, as what the LLM extracts is not copyrightable.

Do you simply disagree that there are issues worth addressing?

Also, typically the entity that carries ideas can not be owned by a cooperation. So there is a completely new relationship between property and ideas at play.

pclmulqdq · on Oct 5, 2023

There is no legal concept for a machine to have ideas. I agree with you that the current legal framework is clear, but I disagree as to the outcome: without new legislation, the creation of an LLM on unlicensed content seems to be the creation of a derived work without permission of the owners whose work you are deriving from, so the creation of the LLM itself is copyright infringement.

troupe · on Oct 5, 2023

But we already have a way to deal with derived work. Reading a bunch of books and then writing with the ideas you learned is perfectly fine. Copying a paragraph without citing it is not. Mimic the style of authors you want to emulate is fine. Passing off their book as your own is not.

If creating an LLM is copyright infringement, then we could say that a human reading a book is copyright infringement.

sensanaty · on Oct 6, 2023

But a human reading !== A computer "reading".

Why do people so easily and readily conflate the two? Is it just because it makes the AI hype sound more real, if you compare it 1:1 with a human brain?

Machines and humans are not equal or even in the ballpark of being similar in any way other than superfluous ones, and should (and already are) held by different standards, both legal and otherwise.

Certhas · on Oct 6, 2023

And thus copyright is not a good framework for LLMs...

It's obvious that this technology was not considered when developing current legal concepts, so if the concepts produce unsatisfactory outcomes, they need to be reevaluated.

skybrian · on Oct 5, 2023

Technically, the money comes from people who pay to use AI (customers), or alternatively to buy art from artists.

To the extent they are out-competing artists, it's diverting a money flow from artists to somewhere else. But a lot of that money might be from people who would never have bought that art to begin with?

In any case, the money never belonged to the artists. It never got to them. The people it actually belongs to (the customers) get to make the decision about what to buy.

jazzyjackson · on Oct 5, 2023

I think I agree but that never stopped the trade associations from coming after pirates for lost revenue. Then again, that wasn't really a winning strategy since they spent more on legal fees than they ever recovered from broke college kids and single moms.

mensetmanusman · on Oct 5, 2023

Just tax the use of the technology and give back to humanity in terms of less work/ubi/etc.

jazzyjackson · on Oct 5, 2023

I'm really curious about the political expediancy of this - the right wing tends to hate UBI but they also resent Big Tech and are probably not fans of algorithms taking their jobs, so maybe we can come to a compromise.

denton-scratch · on Oct 5, 2023

"Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).

According to WP, there are four factors to consider:

1. Purpose, including whether it's commercial.

2. Nature of the work (e.g. fiction/non-fiction, published/unpublished)

3. How much was used

4. Whether the use impairs the author's ability to exploit his work

====

1. The use is definitely commercial.

2. This use consumes anything and everything.

3. This use consumes entire libraries.

4. It can diminish the market for the author's own work (observe authors complaining that GPT simulacra of their work are competing with their own publications on Amazon).

Each of these factors is to be weighed by the court; no one factor is decisive. But, for example, if you use the entire work, factor #2 weighs against you. It looks to me that the four factors all weigh against OpenAI.

rpdillon · on Oct 5, 2023

The models that are trained on these inputs absolutely do not contain all of the text that they're trained on. Even with non-quantized models you're looking at somewhere in the realm of 3% of the training data being present in the final model.

Further, these models are not used to reproduce copyrighted works... they would be one of the least efficient ways to do so.

I also don't think there's a particularly strong argument that training on a book or image decreases the market for the original book or image. To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

I think what's really happened here is that artists' work has utility outside of the originally intended audience, that utility largely comes from statistically analyzing the writing, and the debate is about whether or not the utility of those statistics requires additional compensation for the artists. That question seems to hinge on the intricacies of copyright law, but as I've stated before, I think the fair use argument is quite strong.

pbhjpbhj · on Oct 5, 2023

>I also don't think ... that training on a book or image decreases the market for the original book or image. ... zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones ...

For now. But when you can generate work in the style of an artist without paying that artist, I would. More importantly, media production companies will and so will use access to people's post work to put them out of work now.

We need to distribute the benefits, we're drifting towards copyright infringement being allowed by rich corporation (with resources to generate "AI") as long as they abstract it through a transformer, or whatever.

antifa · on Oct 6, 2023

Crypto was a decentralized ponzi scheme. AI will be decentralized mass copyright infringement.

artninja1988 · on Oct 6, 2023

Copyright laws are fascist and should be abolished

Turing_Machine · on Oct 6, 2023

> But when you can generate work in the style of an artist without paying that artist, I would.

You can't copyright a "style".

pclmulqdq · on Oct 5, 2023

> To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

Why jump straight to one of the longest popular books in English? I'm sure plenty of people who would otherwise buy poetry anthologies are asking ChatGPT to write them a haiku. It's also entirely conceivable that ChatGPT can perfectly reproduce a poem in its training set - it can produce paragraph-sized quotes.

rpdillon · on Oct 5, 2023

It was a reference to the copyright suit brought by the author.

https://apnews.com/article/openai-lawsuit-authors-grisham-ge...

joaogui1 · on Oct 5, 2023

Where does this 3% figure come from?

Zuiii · on Oct 5, 2023

These models are generally trained on tarabytes of data, but are usually 10s of gigabytes large (or much less if quantized). The latest true open source model, mistral 7b, is only 3GB (0.3% of a TB) when quantized.

rpdillon · on Oct 5, 2023

I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.

The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.

Here's my original back of the napkin analysis:

https://news.ycombinator.com/item?id=36681440

cmdli · on Oct 5, 2023

Copyright doesn’t apply to consumption of a work, only redistribution. For example, it is not a violation of copyright for you to make many copies of a movie for personal use. Similarly, it is not a violation of copyright for an artist to learn by looking at other works, even if they then compete against the original work. OpenAI’s system is much more similar to calculating statistics from public artwork than copying and redistributing artwork.

rpdillon · on Oct 5, 2023

> For example, it is not a violation of copyright for you to make many copies of a movie for personal use.

This likely varies by jurisdiction, but in the United States, copyright law covers four exclusive rights:

1. The right to make copies

2. The right to distribute copies

3. The right to create derivative works

4. The right to publicly perform the work

Making copies for personal use absolutely violates the first of these exclusive rights.

marcinzm · on Oct 5, 2023

Sure if you utterly ignore all of the fair use parts of US copyright law.

jazzyjackson · on Oct 5, 2023

The law is not clear, otherwise it wouldn't have to go to the supreme court to contrive the notion of "time-shifting" to allow for making personal copies.

https://en.m.wikipedia.org/wiki/Sony_Corp._of_America_v._Uni....

marcinzm · on Oct 5, 2023

Sure which is why saying it's absolutely a violation isn't a correct interpretation. So is saying it's absolutely not a violation. Even if the supreme court rules on the issue it's not absolutely anything because that ruling only applies until a future supreme court potentially changes its mind.

rpdillon · on Oct 5, 2023

I understand how my use of "absolutely" could be misleading. I'll attempt to clarify.

My understanding of the reality:

* Violations of those four rights give copyright holders standing to sue over duplication.

* Fair Use is a defense against that violation.

What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written, and because of this, you expose yourself to lawsuits when you make copies for personal use (assuming anyone could ever find out...I'm not, and never have, argued that copyright law is correct or good). The fact that you can defend yourself using a Fair Use defense is also important and a good point, but still requires hiring a lawyer, going to court, and winning. In my book, this is quite a bit different than saying it's not a violation.

IANAL, though I follow this fairly closely, and I'm not aware of any cases that have established that copying for personal use is Fair Use; I'd be very happy to read through a decision that establishes this!

dragonwriter · on Oct 5, 2023

> What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written

Copyright law “as the law is written” says fair use is not a violation. 17 USC § 107: “Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work [...] is not an infringement of copyright.”

rpdillon · on Oct 5, 2023

Right, but courts are the only ones that can make a fair use determination.

dragonwriter · on Oct 5, 2023

That's not true other than in the sense that “courts are the only ones that can make a determination that an act would violate copyright law outside of fair use considerations” is also true.

Its true that many times fair use will be somewhat harder to predict what a court would decide on than other aspects of copyright law, because the statute and precedent in combination are less clear, but that's not a categorical distinction.

marcinzm · on Oct 5, 2023

In the US you can be sued for anything by anyone so it's a question of degrees of risk rather than absolutes.

rpdillon · on Oct 5, 2023

Yes, I think we agree.

troupe · on Oct 5, 2023

> "Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).

If someone studies every one of Stephen King's books, studies King's style, studies his plots, etc. and uses the knowledge to learn how to be a writer, there is no need to invoke "fair use" unless they are actually copying King's books. Studying an author's corpus in order to be able to produce works like the author is not copyright infringement. It is just the way we learn...and the way LLMs learn as well.

If we want to say it isn't allowed because LLM read too quickly, then we could do that. Maybe make them not read any faster than humans. But LLMs aren't doing something different than humans other than just the speed at which they do it.

ang_cire · on Oct 5, 2023

If you read the PDF, you'd know that OpenAI admits they ARE actually copying his books, in order to train on them, and they are arguing that this is okay because they don't want to have to obtain a legal copy of the work, because that would be too expensive.

They are literally arguing they should be allowed to illegally download stuff so long as no human reads the downloaded work directly.

To further your analogy, it would be like your prospective writer objecting to being arrested for stealing the books from the bookstore because he just wanted to become a better writer by studying them, not read them for entertainment purposes.

troupe · on Oct 5, 2023

If you are saying that they illegally obtained King's book, then the issue is the fact that the stole the book from somewhere. I don't disagree with your assessment there. But if they checked the book out from the library and let the LLM read it, the fact that it has read it shouldn't be any different than if you had read it as a human.

If I break into the court house to let my child see the artwork in the public areas, the issue isn't whether my child can see, learn from, and copy the style of the artwork. The issue is that I broke into the court house instead of going in the front door when it was open to the public.

ang_cire · on Oct 5, 2023

> if they checked the book out from the library and let the LLM read it

Did you know it's actually illegal to photocopy books you check out from the library? It's an illegal copy.

That aside, the analogies between humans and ML models fall apart when you actually get into the details, which is why AI companies don't want that.

Human brains don't have RAM and Disk storage that keep direct copies of works. Computers do. Human brains are not entirely deterministic calculators. Computers are.

If you have all the state data of an AI model, you can know exactly what it will produce, and the input data is a necessary component of that state data, so it's not transformative in the same way as a human brain in the first place.

2 humans will not create the exact same output given a shared starting state in the way that 2 computers (AI models included) will.

The only reason people falsely believe AI models are different is because no one actually traces through that state data, and there are enough pseudo-random variables involved and state-changes that they create the illusion of non-determinism, but that is, in point of fact, an illusion.

troupe · on Oct 5, 2023

You are assuming that an LLM is saving a copy of the book it reads.

denton-scratch · on Oct 5, 2023

The LLM is the output of the training system; the LLM doesn't do the reading, the trainer does. The LLM doesn't "contain" a copy of the book.

But some LLMs seem to be able to quote on demand from a wide range of literature; these models evidently do have some kind of memory, even if it's been distilled and blended and distilled again.

ang_cire · on Oct 6, 2023

No, I am actually reading OpenAI's PDF linked here, where they say they make a copy of the online corpa to train on, right at the top of page 2:

> By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed

Whether they retain that later is immaterial to whether copyright infringement occurred. Deleting a movie you illegally downloaded after you watch (train on) it doesn't make it legal.

troupe · on Oct 7, 2023

I agree with you. What I'm trying to differentiate is between the way you obtain a book and whether training an LLM on a book that you own the rights to read should be considered copyright infringement.

ang_cire · on Oct 9, 2023

Unless the license you have for the book includes rights for commercial usage, it should be illegal to train a model that is being used for business, imo. In terms of individuals training models for their own personal use, I think that becomes much more tough to answer, but that's not really cropping up in the legal fights.

pclmulqdq · on Oct 5, 2023

LLMs don't read, they are trained. Yes, training involves saving a copy of the training corpus.

pests · on Oct 5, 2023

This reminds me of the Google Books lawsuit back in the 2010s. Google didn't ask permission from every copyright holder but was still allowed to scan books due to fair use.

ang_cire · on Oct 5, 2023

The big difference is that Google legally obtained the copies of the books they scanned. OpenAI did not. Publicly-accessible != legal to download/ copy.

The copyright infringement in Google's case was not illegally obtaining the original work, it was about the snippets they provided.

In OpenAI's case, it is also about the illegal obtainment (downloading) of the original work, which they admit to in their brief, but argue shouldn't be illegal if they intend only to transform it. That element was not present in AGI v Google.

kelseyfrog · on Oct 5, 2023

This very lawsuit is referenced in OpenAI's memo as precedent

> Perhaps the most compelling case on point is Authors Guild v. Google. There, search engine Google digitally scanned “tens of millions of books” without their authors’ permission for inclusion in a searchable digital database.

denton-scratch · on Oct 5, 2023

Yeah, I said that.

You're not the only commenter that has pointed out to me that the model doesn't contain the training corpus; that the model is analogous to a human writer who learns to write by becoming a voracious reader.

I acknowledge that, I think the analogy is fair. The problem is that the memory capacity of the model appears to be humungous; I've never played with ChatGPT, but I understand that it can quote accurately and at some length from a wide range of literature.

By the model, I mean the model behind ChatGPT. I suspect this "memory" amounts to having unimaginable numbers of parameters, i.e. something like overfit. I realize that's just one LLM, and there are other models that are neither generative nor language models, and that have to be much more restrained in what they consider a "parameter".

pests · on Oct 5, 2023

> Studying an author's corpus in order to be able to produce works like the author is not copyright infringement.

You didn't finish the analogy.

Not just produce "works like the author" but being able to regenerate those original works word-for-word.

Is that copyright infringement? If I have the entire Dark Tower series memorized word-for-word, and reproduce passages or entire chapters on-demand for others. It's just the way some people learn.

troupe · on Oct 5, 2023

Well yes. If you reproduce passages from a book (from memory or otherwise) and present it as your own, that is copyright infringement. We have lots of well established law about how much things can be similar without infringing.

LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.

pests · on Oct 5, 2023

> LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.

So all of them? None of the LLM's are "functioning correctly"? Because I can get every LLM to reproduce passages from a book.

CamperBob2 · on Oct 5, 2023

The copyright violation, if there is one, is committed by the person who extracts data from the model and uses or distributes it in ways not permitted by law.

Not by the person who feeds data into the model... and not by the model itself, for merely retaining an unrecognizable, unusable, highly-compressed and -transformed representation internally.

Seems simple enough. Any other judicial outcome will either halt the process of human intellectual evolution, or transfer a staggering amount of power to people who DGAF about US copyright law. So be careful what you wish for.

tyingq · on Oct 5, 2023

>Like you can make a transformer output some part of it's training data - but not an entire work

There's been many examples of getting large verbatim passages of text or large image regions, etc, from AI tools.

I don't think you need to use the entire work to infringe. Certainly lawsuits claiming use of part of the work have had the rights holder prevail.

dcow · on Oct 5, 2023

You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory.

What’s really happening is that AI models have much better memory than humans and are more precise in their output. It would be stupid to try and “dumb down” AI models because they’re better at remembering some licensed content.

In spirit, it’s still fair use.

tshaddox · on Oct 5, 2023

> What’s really happening is that AI models have much better memory than humans and are more precise in their output.

And yet, presumably we agree that a simple file server that serves up exact copies of copyrighted work does constitute copyright infringement. What's the difference? You could also say "what's really happening is that the file server has much better memory than humans." Duh!

It sounds like you're saying that, because an AI model is a very convoluted and sometimes inaccurate way to implement a computer system that sometimes serves up exact copies of copyrighted works, it's not copyright infringement when that computer system does serve up an exact copy of a copyrighted work. I'm not quite understanding the argument.

dcow · on Oct 5, 2023

Two things:

1) I’m generally copyleft and would argue that copyright as we know it is nonsensical in the digital era and needs to be entirely rebuilt to make any sense whatsoever. And that as is it harms the commons more than it protects creators. So yeah a fileserver has always been a game changer just like the printing press was and we’re far behind as a society, legally.

2) When a computer system does reproduce copyrighted content verbatim, it’s infringement the same as if a human did it from memory. That wasn’t my point. My point was that use of copyrighted content to train the model is fair use because it’s no different from a human consuming the content and committing it to memory.

blackoil · on Oct 5, 2023

Copyleft licenses are a form of copyright. Unless it is some other meaning I am not aware of with left-right word play.

jazzyjackson · on Oct 5, 2023

On one hand, copyleft licenses are a creative form of copyright to enforce the wishes of the author to allow derivative works under the condition they are also distributed freely. Let's call it weak copyleft, the pragmatic variety.

OTOH are copyright abolitionists who are offended by the notion that their freedom to copy and modify code on their own harddisk could be restricted by a mere "license", to them the notion of intellectual property is poppycock. The Strong Copylefties consider GPL a necessary evil, a way to use their enemies' tools against them, to spread their ideals of free culture amidst a corporatist hellscape.

Or it could merely be someone taking the naïve reading of "anti-copyright"

dmbche · on Oct 5, 2023

But the human that consumed the content paid for it didn't they?

hiatus · on Oct 5, 2023

What if they heard it on the radio? Or read it in a library?

dmbche · on Oct 5, 2023

The model didn't hear it on the radio or had to access a library, it was directly fed raw copyrighted media. There are no such nuances here.

The act of getting heaps of copyrighted media for commercial use without paying clearely breaks copyright

dcow · on Oct 5, 2023

So if I hear a song on the radio and it inspires my commercial purposes, then what?

Point being: whether a work is used commercially is not relevant. It's common that we think it is, but it's not. I first read about an LRU cache in my operating systems textbook and later used the concept in a commercial work. I have not committed copyright infringement.

dmbche · on Oct 5, 2023

Then nothing - you didn't gather massive amounts of raw copyrighted goods and consumed them in a comercial goal without paying.

This LLMs are not trained on the radio, especially not exclusively.

Edit0: Did you get access to this book that inspired you legally, as in at a library, during a class, or having bought it yourself? Was it fair use?

Because none of those look like getting heaps of copyrighted stuff for free and claiming you didn't.

If instead you stole the book - then yes, this is similar. I don't care what you do with it - can't steal it and assume you are respecting copyright.

dcow · on Oct 5, 2023

I am a product of the impressions left by massive heaps of copyrighted content. One song on the radio is just a rhetorical device.

If OpenAI rented all humanity’s media from a library and used them to train an AI model then that seems 100% ethical to me.

Now if you ask the model to recite the script to Breaking Bad and it does so perfectly and I think that grants me copyright authority over it then we’re going to have problems. It’s just not the model or tool’s problem.

dmbche · on Oct 6, 2023

Did OpenAi rent it? Or did they get it without respect for copyright?

How about you?

This is the point

dcow · on Oct 6, 2023

You’re lost in the weeds. I know that’s the point it’s why the whole song on the radio thought experiment got brought up. The question was, if an AI model trains on public radio waves, and hears a copyrighted song, is that infringement? My position is no, it’s not because the radio station had a license to broadcast that song on the radio.

Similar, if all the books used to train a model are available in the library, so long as someone rents the books, then they can be used to train a model.

The question was directed at you. I don’t know why you’re repeating it back to me like I didn't know what I was asking…

dmbche · on Oct 6, 2023

Feel free to chat with a lawyer on copyright

km3r · on Oct 5, 2023

The file server is only infringing when it serves those files. Photoshop itself isn't infringing just because someone recreates a famous art piece in it, it's the end user that is infringing. The difference between server driver vs user driven creation shifts responsibility.

flextheruler · on Oct 5, 2023

Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

The courts are going to rule in favor of these authors if they have a basic understanding of what’s happening.

TeMPOraL · on Oct 5, 2023

> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

They actually are not that different from Photoshop. Regarding providing you copyrighted materials, if you instruct it to generate something someone else already copyrighted, perhaps by using a feature meant to reproduce existing art style, you will generate an infringing work.

As for "ingesting of other copyrighted works to work optimally", you don't know what goes into designing and building Photoshop - how many third-party datasets or copyrighted assets, which get embedded deep into the application in a form that the end-user cannot discover or consume. You don't know, and it doesn't matter, because Adobe using copyrighted materials in building Photoshop does not propagate copyright claims to you the user/renter of Photoshop. Same can be argued about LLMs - copyrighted inputs from training set get sufficiently blended when turned into weights that you, the end user, should be shielded from any IP claims related to the training data.

(Yes, the last point enables "copyright laundering", but I'm not convinced this is a problem - not compared to regulatory environment trying to prevent it.)

flextheruler · on Oct 5, 2023

My point is simply these models should not ingest copyrighted materials without paying the authors or publishers. Litigating end-users who are using LLMs that have ingested pirated copyrighted would be so complex as to not be worth it except for large businesses. If I want to ingest/read a book I need to pay money but if an LLM does it they’re free to pirate the book? Why?

And why is it that I pay OpenAI to generate data based off books it stole for free when I had to pay money for the same book.

I know what I am actually paying for is the model obviously but it just feels extra wrong to be paying a company for a service it’s built using pirated content.

pixl97 · on Oct 5, 2023

>If I want to ingest/read a book I need to pay money

Or you just go to the local library. Or you borrow it from a friend. Or you stand in B&N and read it in the store.

You're arguments are not great.

dragonwriter · on Oct 5, 2023

> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

Photoshop now has generative AI features that leverage language models as well as training on imagery, so this is literally false when discussing Photoshop as it currently exists.

mjr00 · on Oct 5, 2023

What about a file server which hosts only encrypted files, which are unusable garbage on their own, that happen to turn into exact copies of copyrighted work when supplied with the right decryption key? That's user-driven creation, right?

haberman · on Oct 5, 2023

I would say it’s reproduction of an exact copy that represents copyright infringement, not dissemination of the digital brain that remembered it.

If and when someone tries to profit off an AI work that would be copyright infringement if a human had made it, it should be copyright infringement of an AI does it also.

mjr00 · on Oct 5, 2023

I'm not disagreeing with AI model training fair use, but this isn't the argument for it.

"(new tech) does the same thing as humans, just better" has never been a valid defense. It's like saying a human could explain the plot of a movie and draw the scenes, therefore it's okay to bring a camera into a theatre and record a movie and distribute it. Or that a human can hear a conversation and remember what was said, so there's no distinction between that and recording the conversation using a phone.

dcow · on Oct 5, 2023

But an AI model doesn't record the original work verbatim with the goal of directly reproducing the original.

Aside: and you can bring a camera to a theater and record a movie and use it in a transformative work. And a human could still liable for damages if their hand-drawn performance of Star Wars detracted from Disney’s revenue. I’m not saying I agree, just stating tue status quo.

Training a model uses the work only to calibrate weights that govern entirely independent output. The fact that it can recall exactly in some cases is a secondary effect of the technology.

Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.

mjr00 · on Oct 5, 2023

> Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.

I agree with this. I only disagree with the assertion that AI or $newtech "doing something humans already do but better" has any legal importance. There are many existing laws which apply only when using technology. It's legal to drink and run, but not drink and drive, even though they both get you from point A to point B and cars just do it faster.

jonny_eh · on Oct 5, 2023

> You can also get “the whole work” by asking another human to recite the lyrics

Is that human earning money by reciting the lyrics? If so, that's already not allowed. It's called a performance.

jameshart · on Oct 5, 2023

Being paid is not what makes it into a performance. Having an audience, and the purpose of the recitation, are what make it performance.

If I pay a babysitter to look after my kid and they sing the child a song to get them to sleep, it’s not an infringing performance.

even if you are paying ChatGPT to answer your questions, if you ask it to tell you the lyrics of a song and it does so, that is not necessarily infringing.

If I am preparing a legal brief for a copyright case, and I pay a paralegal to transcribe the lyrics of a song, and they do so and send them to me in an email… is that copyright infringement? It seems very unlikely.

I just can’t come to any position on LLMs other than that the users of the LLM have to be held responsible for how they choose to use the output, not the LLM provider.

LLMs need to be aware of the content of copyrighted works in order for them to be able to fully and comprehensively communicate with humans who are immersed in and aware of the content of copyrighted works.

whynotmaybe · on Oct 5, 2023

If the LLM is recognized as infringing, could any book Publisher sue HP when someone prints a copy of their book through HP Smart?

jazzyjackson · on Oct 5, 2023

that's only half of it, the half that's been litigated via Xerox and Betamax, no - the manufacturer is not liable for what end users do with their product.

But what Xerox and Sony didn't do to build their machines is pirate everything they could get their hands on as a part of the manufacturing process.

dcow · on Oct 5, 2023

Who says OpenAI pirated it? Unless the content was pirated in the first place simply showing it to an LLM is just like letting your friend borrow your book.

When Google crawls websites to build a search index, we don’t expect Google to pay royalties… all these analogies at least demonstrate that copyright is impossible to apply consistently and our notions about what’s fair are wholly subjective.

Spivak · on Oct 5, 2023

> Is that human earning money by reciting the lyrics? If so, that's already not allowed.

This is a common copyright misconception. Whether or not you earn money from it doesn't determine whether or not it's a performance of the work.

jonny_eh · on Oct 5, 2023

True, my point was that the LLM is performing for the public. Thanks for the clarification.

dcow · on Oct 5, 2023

Covers are generally transformative, right?

Edit: I guess even if they are SCOTUS recently decided that even transformative works can infringe if they compete commercially. So the question is not: “did the reciter make money performing”, it’s “did the reciter’s performance detract financially for the original artist”.

wlesieutre · on Oct 5, 2023

Song covers are also a special case with something called a "compulsory license" where the copyright owner is required to license it to you, can't be denied. You just do it and and pay them preset royalty rates.

sitkack · on Oct 5, 2023

That is only if it is recorded. You can't get a "compulsory license" if you are going to perform that cover in front of an audience. That you have secure from the rights holder.

Retric · on Oct 5, 2023

No, covers aren’t considered transformative. I think a great deal of this discussion comes from people not understanding copyright.

https://attorneyatlawmagazine.com/public-articles/intellectu...

dcow · on Oct 5, 2023

I personally think our law regarding covers has been heavily influenced by record labels and is wrong about them, in practice. I’ve already heard the original enough times that if I want to hear a string quartet perform Viva La Vida it’s because it’s a new refreshing piece of art. The whole compulsory license thing indicates the law got it wrong. Anyway…

Retric · on Oct 5, 2023

The song is distinct from an artists performance of that song, but let’s move outside of music to books etc.

Suppose A creates an epic poem, B does a poetry reading aka a performance of A’s poem. C records it, then plays that recording back in public. D reinterprets the poem making a new public performance. B can successfully sue C but not necessarily D if it’s sufficiently distinct. A however can potentially sue D, C, and B if none of them got the rights. [Substitute A making a painting and B making some needlepoint copy or whatever and the same principle applies.]

This is why JK Rawling got paid by the people making Harry Potter movies, she could sue if they didn’t pay. Trademarks may also be involved, but even without that if you want to make an MMORPG based on The Dresden Files or whatever be prepared to fork over cash. Unless you follow the Disney approach and use public domain works.

aldousd666 · on Oct 5, 2023

'Covers' require the payment of ASCAP fees -- usually done by the facility, not the performer, but if the facility does not pay, the performer can be liable. You even need a license to put a jukebox in your bar, so that argument doesn't hold up.

tyingq · on Oct 5, 2023

The AI tools are often representing the output as something their customers can use without restriction. I'm pretty sure that wouldn't work in your analogy. If I'm an agency and a customer asks for jingles, can I recite large parts of lyrics of copyrighted songs for them to use...as if I made them up?

jameshart · on Oct 5, 2023

This is not a new phenomenon.

Google image search produces copyrighted and restricted use images. On clicking an image it includes a little caveat warning “Images may be subject to copyright. Learn More” - but no specific attribution or copyright claim. It’s possible if you go to the source where Google found it you’ll find the attribution there but also very likely you won’t.

If an AI tool just says ‘this might be subject to copyright’, is it all good?

Words to that effect appear, for example, in the GitHub copilot terms and conditions.

tyingq · on Oct 5, 2023

There are people claiming everything coming out of these tools is transformed and wholly unencumbered.

dcow · on Oct 5, 2023

Yeah I’m understanding the nuance much more now. There is a difference between: is it okay to use copyrighted content to produce OpenAI’s product, and is a verbatim reproduction of a poem fair use.

jameshart · on Oct 5, 2023

That’s a naive ‘what color are my bits’[1] mistake - classic software developer mindset.

Level one programmer naïveté is just ‘bits are bits, it doesn’t matter where they come from. Bitwise identical things are indistinguishable’.

Level 2 naïveté is when you accept that bits have color depending on how they came to be arranged thus, and that there are processes that get rid of the old color on some bits, and replace it with a new one. But then you figure - like a programmer - that if you compose that process with some other process you can get rid of the colors you don’t like.

Enlightenment is realizing that the law cares not one jot for the specific processes you apply to bits or their colors but criminalizes (or at least proscribes) particular actions and cares about things like intent.

[1] https://ansuz.sooke.bc.ca/entry/23

km3r · on Oct 5, 2023

How is this any different than something like Photoshop? You can recreate (and therefore copy) a piece of art and it's infringement, but not on the part of Photoshop. Yet, Adobe is still well within the right to say you can use what you create with Photoshop. Why can't AI tool makers have the same claim?

pclmulqdq · on Oct 5, 2023

AI tools themselves may be the infringement. Photoshop is not.

jerf · on Oct 5, 2023

"You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory."

You seem to be speaking as if this somehow would cleanse the copyright status of the work in question, but it wouldn't. If you memorize a book, or a friend does, and your or your friend recite it to someone who transcribes it, the result is still copyrighted by the original entity, and if you try to sell the result, you'll be on the hook for copyright violation. This would do nothing to the copyright status whatsoever, so whatever argument you're trying to imply doesn't hold.

Spivak · on Oct 5, 2023

The only difference between the two cases is which human violated copyright. If you ask a musician to play a cover of a famous song without the requisite royalties they violated copyright. If you instruct a machine to do it you violated copyright. Machine has no thoughts, head empty, does not know what copyright is, does not know abc's.

Which is also why machines can't create copyrighted works either. The standard example is that making a machine to generate random images doesn't copyright them, but if an artist chooses some of them because they look good then he may copyright them.

sofixa · on Oct 5, 2023

> In spirit, it’s still fair use.

Spirit crafted by people who couldn't even begin to imagine LLMs. The proper answer here is new laws clarifying the stance of copyright and LLMs/generative models in general, not trying to reason like a 19th century person about 21st century tech.

The EU's approach is much more sensible - there is this new thing with vast ramifications, let's sit down and see what legal framework is needed for it.

ResearchCode · on Oct 5, 2023

People know when they are reciting something they have memorized. Do language models not remember having memorized what they're writing?

theptip · on Oct 5, 2023

How large? Copying some verbatim is still fine under fair use.

dragonwriter · on Oct 5, 2023

Copying entire works is fine under fair use in certain circumstances, too. (See, e.g., time-shifting.)

theptip · on Oct 5, 2023

For written text? Not aware of any examples, can you share any?

(To be clear by copying I mean repeating the text in your derived work, perhaps I should have been more explicit. The crux of this case is whether “copying into training set” counts as copying the whole work. )

dragonwriter · on Oct 5, 2023

The first examples that comes to mind is reproducing an entire copyright-protected article, either a single copy for personal use by a researcher or multiple copies for classroom use.

DoughnutHole · on Oct 5, 2023

That's an interesting example because it illustrates out how fair use is in part defined by its effect on the market.

A teacher copying a New York Times article verbatim and distributing it to a classroom for the purpose of analysis or criticism is likely to be held by the courts as fair use. On the other hand a teacher photocopying an entire textbook (or distributing a PDF) would probably not be considered fair use.

Both are copying educational material in full, but they have a different impact on the market. Distributing copies of an old news article to 30 students realistically isn't a direct substitute for 30 subscriptions to The New York Times. But a textbook publisher's business model specifically revolves around selling its textbooks to students - the teacher is distributing their copyrighted material to their target market and removing any incentive for the students to pay the publisher.

pclmulqdq · on Oct 5, 2023

Yes, the damage to the market of the original is a key factor in a fair use analysis.

It makes you wonder why "Open"AI started getting into legal trouble only when they closed everything down and became a commercial project. When they were a research project, stepping on copyrights was a lot more acceptible (both socially and legally).

jazzyjackson · on Oct 5, 2023

Right, determined fair use by the supreme court in Williams & Wilkins Co. v. United States. You can walk into a library and make a wholesale copy for research purposes. The case involved a medical journal publisher suing medical doctors, who made a strong argument that their ability to make copies was a matter of life and death. Winner: Xerox.

https://en.m.wikipedia.org/wiki/Williams_%26_Wilkins_Co._v._...

RC_ITR · on Oct 5, 2023

Have you heard of 'quotations'?

Plenty of works quote and cite other copywritten works and it's considered transformative because usually the work is building upon or commenting on the quote.

tyingq · on Oct 5, 2023

The context was copying entire works in this particular leaf of the main thread.

RC_ITR · on Oct 5, 2023

Yes, and plenty of people write criticisms of entire song lyrics or poems, effectively republishing the entire thing.

This is actually a huge area of debate on Youtube as Google often errs on your side of the argument (they can choose to demonetize whatever they want), even if they work is technically not infringing.

mannerheim · on Oct 5, 2023

Copying verbatim passages can be fair use if it is transformative.

tyingq · on Oct 5, 2023

Outputting unattributed copies large passages of text to end customers with the implied okay to use it any way they want, though....that's what these tools do.

The end user often has no idea they just received something with potential IP issues.

lsaferite · on Oct 5, 2023

Does the model even have enough information to be able to _know_ that though? If it's simply using probability to ties a series of tokens together into text based on numerical probability, that doesn't store enough information to understand that _this specific sequence of tokens_ represents come specific copyrighted work. Storing that information would _actually_ seem to fall afoul of copyright.

The fact that it can spit out chunks of copyrighted works is driven by the input token sequence and the model weights pointing to a specific path that have an ever so slightly higher probability of being the expected output, right? It's not like the model stores the copyrighted work directly.

(Yes, I know the algorithms are more complex than what I expressed, but the general idea holds in my understanding)

pclmulqdq · on Oct 5, 2023

The model doesn't know anything - people personify LLMs too much. It is a mathematical text predictor that has almost certainly ingested the text it is copying verbatim to string together the words it is reproducing. The fact that it is a highly compressed representation of its training corpus (and thus doesn't "know" that it is copying something) is not an excuse.

I think you could make a good argument about this if you could prove that the text being spit out verbatim is _not_ contained in the training corpus, but that is not the situation we have today.

lsaferite · on Oct 6, 2023

> The model doesn't know anything - people personify LLMs too much.

Perhaps reading people's posts little less literal would help the conversation. I obviously know that the model weights don't 'know' something in the sense that a human knows something, but the model does store information. That information happens to (primarily?) be the statistical likelihood of one token following another token. What it doesn't store is a string of tokens that represents Sarah Silverman's latest book.

From what I can tell, all this angst comes from 3 or 4 related, but different, issues.

1. Did companies break copyright laws when assembling and using training for these models?

2. Does the model represent some form of copyright infringement in and of itself?

3. Does a model's ability to output chunks of copyrighted work have some implication of the legality of the model itself? (using said copyrighted chunks is already a solved issue)

4. Do we as a society owe it to humans benefitting from copyright the continued ability to create copyrightable content without competition from ML models?

I think comingling all of those points is doing everyone a disservice.

My assertion was only about #2 and none of the others. I feel like it's a clearly demonstrable situation that these models _don't_ infringe copyright directly. That being said, I am obviously not a lawyer, and my opinion is just that, an opinion.

FWIW, my general feeling on all of the points is: 1. Quite Likely (but fair-use is a fickle thing), 2. No, 3. It shouldn't, and 4. No, but we need to think through the long-term societal implications of ML decreasing the amount of human labor needed across all markets and come up with a plan that doesn't involve our fingers in our ears.

pclmulqdq · on Oct 6, 2023

I would assume that #2 is actually "yes" given the way derived works work and the fair use tests, and that #1 and #2 are actually very much linked. Fair use is a defense to copyright infringement, and it's a relatively complex balance of factors. It's relatively inarguable (even OpenAI isn't arguing this) that the model isn't a derivative work of its training set: they are just arguing that they have fair use rights to the contents of the training set. Only one of the factors in a fair use analysis is how transformative the use is, and I think it's hard to argue that training an LLM isn't a huge transformation. However, the other factors weigh pretty heavily against LLMs here, and the Author's Guild lawsuit is a pretty good set of arguments as to why. It's up to a court to decide whether the transformative nature outweighs the other factors.

If you're lumping the fair use question into #1 and the "is it a derived work" question is #2, I'm pretty sure that nobody on any side of this agrees that it isn't a derived work. Once you have a derived work, you can either ask whether they were licensed to produce that work (no) or whether it was fair use (possibly).

lsaferite · on Oct 6, 2023

On a purely technical level, I feel very strongly that the model is not a derivative work any more than a dictionary is a derivative work. The model is a recording of token stream probabilities. The training data is, in my mind, where the conversation about derivative work and fair use comes into play. Again, I'm not a lawyer and have little exposure to the legal profession, so my opinions are my own.

pclmulqdq · on Oct 7, 2023

There's a big difference between a model and almost every other form of compilation of human knowledge, which is when the human intervention occurs. A dictionary is a collection of definitions from humans that is derived from their experience reading texts. Human knowledge and experience moderates the flow from the text the human reads to the text in the dictionary. LLM training has no such moderation, and that is a significant factor in whether something is a derived work or not.

This is really all up to the courts, and I don't think anyone is confident about how it will shake out. However, the fact that OpenAI isn't even trying to make the argument that an LLM isn't a derived work of the training set (they are going straight to fair use, which is an acknowledgement of infringement) suggests that this is not actually contentious.

verdverm · on Oct 5, 2023

We might also consider that the LLMs will take traffic away from original sources, much like Google Search showing the relevant passages that answer your question. It's not always about outputting the exact passage, though this point may be tangential to the post & case at hand.

My primary usage of LLMs fit this pattern. Use Bard for simple programming questions, typically something I would have previously searched for via docs/reference/SO for. Now I don't have to filter the results or skim the pages for the right section, it's just given to me, and I can ask/make follow up questions or corrections. This here is what makes them much better than vanilla search, the ability to iterate and refine.

acomjean · on Oct 5, 2023

I think in some ways LLMs have shown people want simple answers to their questions when they pose it to a search engine.

Stack Overflow was good because it had answers (one chosen as right) and a little bit of discussion about why and how and other considerations.

Much of the web is verbose to an extreme to get the search engines to rank it highly. People don't want that, they just want answers.

verdverm · on Oct 5, 2023

This is exactly right, and hopefully the LLMs can get better at adding in the

> discussion about why and how and other considerations

If(when) they knew when/how to ask better follow up questions, that will take it to the next level