Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI's justification for why training data is fair use, not infringement [pdf] (uspto.gov)
197 points by sillysaurusx on Oct 5, 2023 | hide | past | favorite | 425 comments


Skimming, this quote by itself seems enough to solidly make their case:

>>> "The “object of the original creation,” in other words, is direct human consumption of the author’s expression. 18 Intermediate copying of works in training AI systems is, by contrast, “non-expressive”:19 the copying helps computer programs learn the patterns inherent in human-generated media. The aim of this process—creation of a useful generative AI system—is quite different than the original object of human consumption.20 The output is different too: nobody looking to read a specific webpage contained in the corpus used to train an AI system can do so by studying the AI system or its outputs. The new purpose and expression are thus both highly transformative.21"

Like you can make a transformer output some part of it's training data - but not an entire work, and more to the point, that's not the point (to copy) of the model or it's training data.


I think we should not stop innovation, but let's not minimize the impact on the producers of the content (the whole of humanity).

First of all, even if the models never reproduce any of the copyrighted content verbatim, they will soon be good enough that they will be able to replace the work of the people [1] that produced the content the AI was trained on, like writers and programmers.

Second of all, the nature of the use of the copyrighted content is purely for-profit, and the ability to squeeze the profit out of the best models will stay with the largest corporations -- effectively transferring wealth from the people that produced the content, to the corporations.

We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair. Who wrote which paper [2], or who supplied the initial GPU training, should not really matter that much in the grand scheme of things.

[1] or at least cause downward pressure on their earning potential

[2] you could use similar arguments here they use for using the copyrighted content -- each paper (except maybe the Attention is all you need paper) contributed only marginally


> We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair.

Hear, hear. Unfortunately this is more or less impossible given current copyright law.

Suppose you scrape libgen and turn it into training data, then you release the training data publicly. Since the vast majority of every book appears verbatim in the training data, is this sufficiently transformative?

I think yes, it is, because nobody is going to read those books from the training data. When I made books3, I felt it was important to render each book into high quality text. But it turns out that when you convert Jurassic Park into a text file, there's no good way to read it anymore. Good luck trying to bookmark wherever you left off -- it's all one gigantic file.

But nobody seems to agree. The Danish Rights Alliance (https://rettighedsalliancen.com/) aggressively DMCA'ed anyone that hosted books3, even going so far as to DMCA The Pile from academictorrents: https://academictorrents.com/details/0d366035664fdf51cfbe9f7... with the justification that ~100 copyrighted books appear in the training data, so therefore they have the right to DMCA. Right now most of the world seems to agree, but I'm hoping that opinion will shift as the years tick by. Surely no one can believe that a plain text document poses a serious threat of economic harm to the original author. So the question is whether the original author should be allowed to deny everyone else the right to transform their work into a form that machines can read.

For my part, I've been planning a books4 dataset, but this time similar to LAION: it's a script that spiders libgen torrents (https://libgen.gs/torrents/libgen/) and converts all the epubs into text files. That way, if LAION isn't infringing, then books4 can't be infringing either. (Of course, hosting the actual training data anywhere is pretty hard nowadays, but it should only take a few days to convert 38TB of libgen into ~2TB of plain text.)

This is the only way to create an open source competitor to ChatGPT.


I'm actually on your side because I think copyright laws should be radically nerfed and things like books3 are greatly beneficial to society, but I wouldn't buy your argument about "no good way to read plain text files".

The text file contents all the text in the book in an (at the very least) machine-readable way, it is perfectly feasible to write a program to put it in ebook form or to play it as an audiobook, and then the text file becomes desirable for laypeople to read.


Unfortunately this is true of any encoding scheme, short of scrambling the order of the paragraphs. And although scrambling the order might seem tempting, it destroys the ability to train large context windows — 32k context tokens is enough to fit most of a book into a prompt, and this window will only grow bigger.

If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows. That’s why it seemed important to justify a plain text training format, since anything on top of it would be equivalent.

Indeed, one alternate training format would be to ship the raw html from every epub file, then process it into text at runtime. But this makes it trivial to reconstruct the original epub file and use it in an actual book reader.

It’s frustrating that we can’t share the epub files, because there are so many advantages: you can scrape the metadata, you can tweak the rendering to plain text, you can get semantic info from the images (and even OCR them — it turns out that lots of coding epubs show code examples as screenshots, because epub html rendering is so primitive, so this would be the only way to let your model learn from those).

All of that is why I’m leaning towards "make a script to spider all of libgen and cache the epub files locally". But I haven’t finished calculating how much disk space this would require.

I’m sad that researchers will have to wait days for their training data instead of downloading it in a few hours from a high speed cache, but it seems like any such cache would be swiftly DMCA’ed, so there’s no alternative.


I don't know if this is tangential or not, but thank you. You've helped me progress my armchair understanding of how we might give the concept of copyright more finesse in the digital age.

The insight is that encoding content is not functionally the same as copying verbatim a work which is the original intent of the concept.

For example: if I have legally obtained a copy of e.g. https://archive.org/details/free_culture, then I am at liberty to encode it in whatever format I need to be able to feed it to a machine/tool for whatever purpose I want. I am not infringing on copyright because I legally obtained the work, and the machine is not because machines can't.

I think at least this much OpenAI has in their favor. If they can prove they legally obtained all the training material (I do think it's fair for them to be required to pay once for it) then I don't think there's any world in which it makes sense to try and allow content creators to further extract royalties from that process alone.

If a user asks an AI for a copyrighted poem, for instance, and the user goes and republishes that poem as their own, I do think it may make sense to grant the original author royalties under current law.

I really hope we can legally pick these two concepts apart and focus on each scenario independently. I see a lot of people here arguing that giving a work that you have legally obtained the rights to view to a machine model is inherently copyright infringement because you had to copy the work and yada yada is it fair use? I really think this is wrong both in the interpretation of copyright and practically as ideas can't be owned and it makes no sense to limit which ideas were used to make what commercial product etc. (we don't do that today and AI doesn't change that).


>If your (quite reasonable) argument holds water, then it sinks our ability to ever share copyrighted training data with large context windows

Only if the training data is under copyright! How about data that is in the public domain, or granted license specifically for training - Mozilla's commonvoice is an example.

Maybe someday there will be a rulings that do for ML training datasets what earlier ones did to legalize cleanroom design.


Sadly there just isn’t enough public domain data. It would mean that it’ll be impossible to catch up to ChatGPT.

One way to see this is to imagine a Midjourney competitor trained solely on public domain images. The visual quality of the model will always be worse.

As for licensing, I agree for commercial entities, but there should be exceptions. If a model is open source, it benefits everyone, and so it shouldn’t need to have been licensed. There are a few reasons why this is pretty important, but the main one is that without it, the open source community has no chance whatsoever of creating cutting edge models.


I think you can reasonably imagine that if content is available in a library then everyone could organize an effort to check out the content and add it to the model. That effort sounds like useless theatre at that point so just let the commons keep an `all-booksN.zip` corpus around for the purpose.

You know, you could probably even argue traditionally that taking all the books in the world and adding them to a corpus for the purpose of creating an LLM would be a transformative work since it doesn't compete with or detract from any of the originals…


Why can't I read or bookmark a huge text file? I can and have. Books3 (though this is the first time I'm hearing of it) shouldn't be fair use because it's inconvenient to read, it should be fair use because reading is not the intent. The intent is to train computers with it.


It does contain copies of copyrighted works.


Well, yes, otherwise you wouldn't need fair use, you'd just use it.


I would like to get in touch with you related to books4. Do you happen to have discord? or would twitter be ok?

There's currently multiple attempts at creating what you describe as books4.


Please do! Twitter DM is the most reliable. Or if you put some contact info in your profile I can reach out.


Added, and reached out on Twitter.


I'm okay with the idea of no copyrights at all on works that are fully or partially AI derived. Of course, that introduces even more thorny arguments (what legally constitutes "AI"), but the courts can work that out.

It's clear already that companies are going to be want to have their cake and eat it too. When they leverage AI to produce content, they don't want to be forced to pay licensing fees on the inputs to their models. But on the flip side, they absolutely don't want their content being used in models leveraged by other people.

Not allowing copyrights on wholly or partially AI-generated work is probably the lesser evil path to take.


I mean we never seem to care about the downward pressures we put on others, so this seems a contrived place to do it now.

My take on the matter is that facts aren't copyrightable. A LLM is basically recording facts.

I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts. You could easily now generate random sentences and filter ones that pass a certain threshold. LLMs are basically this on steroids. So it's just facts and data, not copyright infringement.

Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.


> I have a background in stylometry where you basically determine authorship by recording statistics on known pieces of writing to try to find the author of another piece of writing. So just extracting facts

The logical jumping happens around this period symbol is spectacular.


I was assuming people were familiar with the process and handwaving the rest.

You measure things like:

* What average sentence length?

* What is the ratio of adjectives to nouns?

* What percentage of sentences are in a passive voice?

* What is the distribution of words used?

* Is "the" used more often by the other than in general usage?

* "whence" vs "when".

* "tyre" vs "tire". Etc.

This could all be printed out and sold as a book and each measure would be a very boring fact that is not copyrightable and contains no copyrightable content.

I could give you a word frequency list from the A Song of Fire and Ice series and George RR Martin could do nothing about it. (Technically, he might have the rights to "Arya" and "Lannister", but to say "the word 'Lannister' appears 1,337 times in the series" would clearly fall under fair-use).


Sorry it's nonsense. You're basically saying:

1. Anything existing is considered a fact.

2. Stating a fact doesn't infringe copyright.

The only conclusion your logic can infer to is copyright doesn't exist.

For example, I can say "the first sentence of A Song of Fire and Ice series is <insert the first sentence>, the second sentence is <insert the second sentence>, the third sentence is...". It's still just a list of facts, so no copyright infringement, right?


I think it that case if you went through the whole book like that a judge would argue that you are infringing because you can recover the whole book like that.

But if you went by and grabbed the very first sentence from each chapter that's is sufficiently in the clear. Especially when you are doing something transformative with it. Like for example analysing what makes for an effective opening sentence.

With copyright you always get to ugly slippery slope arguments, Tom Scott has a great video on it.

But if you cannot extract something the replaces the original work you should be already in the clear. I'm pretty sure Cole's Notes don't need the original author's permission to print and they are arguably a substitute for reading the original work. I've skimmed through entire series by reading the individual episode summaries on Wikipedia. These are in the clear.

The fact that a machine speeds up a human process does not change legality.


How would reproduction of fictional works or the creation of derivative works based on fictional training inputs constitute “a recording of facts”?


I think you're conflating two issues. As I understand it, this justification is about whether the training itself is fair-use. The discussion on whether the outputs are derivative or transformative is separate. Analogously, if I publish a copy of copyrighted poems from memory that would be infringement, but there's nothing infringing about me just reading and memorizing them.


LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense. "Reproduction of fictional works" is what is present in the training data. "Creation of derivative works based on fictional training inputs" is what you might be doing as a user of a LLM. The models themselves are, like GP said, recordings of statistical properties of text, just taken up to 11.


> do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Except it does, and it's in a quite literal sense.

You painted a painting. I took a picture of it and compress it as .jpeg. Byte-wise the .jpeg file has no similarity to your original painting. For someone who doesn't know .jpeg, it's just garbage bytes.

So I didn't infringe your copyright by selling this .jpeg file. The user who decodes the .jpeg file and displays it on a monitor does.

Does it sound right? This is how "weights do not contain training data" argument works.

And before "how about artists who store this information in their brains?" Well human being is a special case for every law in every country. Just like selling a cow's liver is never the same as selling a man's, even they're both organic tissues. A human's brain is always going to be treated differently than a hard drive.


My argument is that the modes are somewhere between your JPEG example and "artists' brains" example. Now legally, it's usually not the bytes themselves that matter, but their colour (provenance). But this doesn't make the case of AI models any more clear - the training process is also somewhere between zipping up a folder full of JPEGs, and a human practicing their art through fuck ton of repetitive reproductions of existing works until they grok the style.


I agree.

The thing I hate about this whole situation is that it's going to be decided solely on which side has more lobbying power[1], since just as you said, it's something in between.

[1]: One could say all the policies are decided this way...


> [1]: One could say all the policies are decided this way...

The silver lining here is that lobbyists can only afford to care about finite number of things at a time - so even if they get some laws their way, other factions can push laws mitigating the damage somewhat.

I currently believe (weakly) that the outcome of more and better lobbying isn't laws getting directly worse for society, but rather the regulatory system grinding to a halt under increasing number of laws that exist to cancel out parts of other laws...


> LLMs - as in the models, the weights - do not contain "reproduction of fictional works or the creation of derivative works based on fictional training inputs" in any meaningful sense.

Since portions of such works can be recovered in inference, they contain at least a lossily-compressed copy of the collection of works used in training. Reproduction isn’t pure coincidence.


Indeed, but learning is effectively[0] a form of lossy compression too. DNN weights are somewhere between zipping up a truckload of JPEGs and deriving facts from first principles. Where exactly are they on this spectrum and how it affects copyright, it's not obvious to me - at least not when trying to argue from fundamental principles. I feel the DNN copyright issues will be ruled on from purely pragmatic position: what's the legal status that upends existing markets the least, and/or is most favored by lobbyists.

--

[0] - I believe it's actually fundamentally the same thing.


> Indeed, but learning is effectively a form of lossy compression

Yeah, that copies in the brain of a human who has experienced the work and not reduced to any othe media are neither “copies” nor “phonorecords” as covered by US copyright law is pretty clear, and that this does not apply to data stored by clever algorithms in computer storage is also pretty clear, so I’m not sure what your point is.

If you want to argue LLMs are people to escape this, then, sure, copyright stops being a problem for their training data, but it still is for their output plus you end up a whole set of new legal problems with using them as people do now, starting with 13th Amendment problems.


> that this does not apply to data stored by clever algorithms in computer storage is also pretty clear,

I'm sorry, this is far from "clear", otherwise we wouldn't be having this discussion. Fair use is a thing. Does fair use apply in this situation? No one knows at this point.


There's a pretty big jump from "machine learning is not inherently different from human learning" to "machine learning models are people".


There's not a big jump from “human learning isn't treated as a copyright violation because it happens in people and not in external media” plus “machine learning should be legally treated like human learning” to “your argument depends on treating ML models as people”.

The fact that people making the second part of the first argument often either are ignorant of or deliberately ignore why human learning is treated the way it is doesn't change the essence of the situation.


That applies to both sides though: people arguing that machine learning should not be treated like human learning often justify it with "ML models aren't people, they're software on a computer". But that argument is also missing the point, and invites this kind of philosophical discussion. The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law. So we can legally round all machine learning down to "zipping a folder full of JPEGs" for pragmatic reasons, and get away with it until we actually create sentient AIs and recognize them as people.

(Which, knowing history and observing how we treat animals, will happen only when the sentient AIs coerce us to recognize them, through violence or the threat of it...)


> The essence of the matter is that there is no high-level principle involved - humans get special treatment because humans write the law.

“The purpose of law is to advance the collective interests of humans in a society” is, I would argue, a high-level principal, and (with variations as to whether it applies to all humans or some subset, and which interests are considered privileged relative to others) nearly universal. Yes, the fact that humans are writing the laws is a reason that principle is chosen, but the idea that there are not high-level principle is just false.


>“your argument depends on treating ML models as people”

They would be "legal people" in the sense that corporations are "people". Some jurisdictions have even granted personhood to non-sentient objects, such as rivers. There's no reason to get held up on the exact word "person" here.


>Otherwise, you'd have to argue that software reading the text contents of a book are copyright infringement.

How so?


Because that's all the ai model is doing. It's reading the books.

How an entity (human or not) changes after reading those books is not in the purview of copyright.

It has not shared them (copied). It has read them, learned from them, and changed itself in response.


In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means (eg as a file). In USA that copying might be Fair Use, it's almost certainly not allowed in UK law AIUI.

Copying a webpage into a cache that allows presentation of the page to a user is only allowed because it is part of rendering the page to a user. Even if a computer only copies two words at a time from a source text, if it copies a substantial part overall then it still copied.

Honestly, I don't think slurping data to train ML models is allowed by copyright (but, I do, probably think it should be, as long as any significant reproductions are then prosecuted as infringements [also we should reduce copyright terms to ~7 years!]).

This is all my own opinion, unrelated to my employment.


> In order to present a book to a ML algorithm you need to copy it, either using a camera, or using other means

Almost all modern books are available as ebooks for the Kindle or otherwise. You aren't doing anything to the content itself that could be a violation.


So, you copy the Kindle book from a server to your local host to feed into your training algo (ie other means). Or you stream it piecewise into a buffer, still copying.


As long as you only train the network and never use it for inference, I suppose that's a reasonable argument. But a person is restricted in action by copyright regardless of the writing implement they choose to use. Whether they use a pencil or an LLM, they cannot freely reproduce copyrighted works in whole or in part, excepting some narrow conditions.


IMHO, an LLM is not "reading" a book in any way, certainly there is no parallel with how people read. It is encoding the book. That's how it can regurgitate chunks of that book later, including, in some contexts, by providing verbatim spans of text from the book.


Why is an AI model allowed to read a book without paying for access to it, but I as a human have to pay for it to read it?


The AI should, no doubt, pay to read the book!

In technical terms, these companies are trying to avoid the need to consider content licensing, which is a major violation to content creators.


You aren't licensing it. You are reading it. A machine reading it is no different than a human. I mean it really shouldn't be. Should a visually impaired person using a screen reader have to "license" the book?


People come up with the most bizarre corner cases to justify freeloading on other’s intellectual property.

Content is obviously being relicensed as the models are not fully open.

On top of that, it is also being exposed as pay-for-use subscription models without paying anything to original authors. How is that fair use?


That particular transformation of the book does well on a fair use analysis. If the blind person were to use the screen reader to put the book on Spotify, it would almost certainly fail a fair use analysis.


But not all the works that have been consumed by LLMs are available to read without a license to do so.


You can check the book out at the library and read it. Someone has to buy it, but not necessarily you personally.


You can check out some books at libraries for free, sure.

LLMs are trained on tons of books and papers that are not available to humans for free anywhere.


Can you give an example of something that they are trained on that a human can't somehow read without buying?


Textbooks for 1 example.

You are telling me that all the books in books2 and books3 can be acquired and read by myself completely free of charge legally without piracy?

If humans can get these books for free legally, why are so many people paying so much to buy them for their college courses?

I thought it's well known that LLMs are being trained on pirated content.


Textbooks aren't somehow banned from libraries. You'll find lots of textbooks in public libraries, school libraries, and other libraries where they can be read without paying for the book.

Now if you are saying they obtained the book from illegal sources that is a valid argument. But the issue there is orthogonal feeding the contents of the books into an LLM. It is legal for me to view and photograph the Mona Lisa. If I break into he Louvre to view the Mona Lisa, the issue isn't that I viewed the artwork. The issue is that I broke into the museum.


Yes but then the issue isn't whether copyrighted works can be used as training data, but how you can obtain those works to use them as training data. I agree that you can't justify copyright infringement based on the fact you are using it as training data. But I do not agree with those that say using a legally borrowed book as training data violates copyright.

But if you have an example of some types of books that it is impossible to borrow from anywhere on the planet please share. Textbooks are readily available from libraries.


No, they aren't banned, but not being able to find one at a library is not a legal justification for me to go and torrent it. It shouldn't be for a corporation training an LLM either.

It's widely known that some of the big LLMs were trained on book datasets that were acquired via torrents. And that those datasets do contain at least some books that are not available anywhere freely via legal channels.


There is nothing illegal about reading a book that you haven't paid for.

What's illegal is making and distributing copies of a book.

But the reader isn't in trouble for reading a book they don't own.


Really?

So I am free to torrent all the books that I want, and there is no legal action that can ever be taken against me?

If so that would be news to me.


If you start distributing torrents and copies to other people?

No of course not.

Creating copies, and distributing them to other people is against the law.

That has nothing to do with what I said though.

What I said is that it is not illegal for a reader to read something that they havent paid for.

Do you see how that is different from distributing copies of something to other people?


That's not what I am asking though...

I am asking about reading content that is not available through any free means, and pirating it in order to read it instead of paying for it from the author.


> I am asking about reading content that is not available through any free means, and having to pirate it in order to read it.

Reading it is legal.

Distributing torrents is not though. That is the part of piracy that is illegal. The illegal part is when you create and distributing copies.

Copyright law has nothing to do with reading stuff that you have not paid for.

Instead, it is about the illegality of creating and distributing copies.

That's why it's called "copyright" law. Because it is about copies.

It is not "readright law".

No, there is nothing illegal about reading something that you havent paid for.


I didn't know it was legal to download books from torrents that are normally only available from the author for a cost.

What about movies and tv shows? I can download them via torrents to view without paying as well, completely legally?

Nobody can ever take any legal action against me ever for doing this?

How though would the creator of the content ever get paid for it if everyone can read and view it completely free? Why would they go through all the work of creating it with no gain to themselves?


Torrents are a somewhat special case since, by protocol, downloading and sharing happen at the same time. So by running a torrent of pirated media, you are necessarily infringing copyright (there are hacked clients that report bogus information that allow you to download only without sharing, but they are easily detectable and ~nobody uses them).

As far as I know, nobody in this country has ever been successfully prosecuted for the mere act of downloading, or even having obviously-illicitly-acquired pirated media in their possession.

As to your final question, the actual answer is because people are lazy. It requires more effort and technical sophistication to maintain a movie file collection than it does to run Netflix. Nearly any piece of remotely popular content is available for free with a bit of knowledge, but it's less hassle to just subscribe to services. There is no world in which piracy actually endangers creators.


I should have used an example other than torrenting, one that doesn't automatically distribute back.

Also, I'd say there is a difference between being illegal and whether someone has been successfully prosecuted or not.

I don't understand though how piracy doesn't endanger creators though. If a small time creator creates something, say a video game, and everyone pirates it instead of buys it, how has that not endangered him? Simply because you don't think it's plausible that enough people would refrain from paying?


Pretty much. For the simple reason that the "legitimate" way is always easier and the people that pirate are always a minority as a result.


> Nobody can ever take any legal action against me ever for doing this?

They can if you create or distribute copies.

That part is illegal and will get you in trouble, and is why people get in trouble for torrenting.

Just reading or watching other people's content isn't the illegal part.


I think it's safe to assume that AI has to pay for the book as well, sooner or later, as shown in the deals OpenAI has made with content publishers such as AP [0]. The question is how much they should pay.

[0] - https://www.pymnts.com/digital-payments/2023/58percent-of-co...


That's not the argument being made, is it? If the AI is trained on pirated data, it's a completely different case than the underlying principle of AI meaning all generated content is copyright infringement.


But I thought they were being trained on pirated data...

Seems like a problem to me.


They're being trained on copyrighted data that was publicly accessible. The lawsuits focus on whether or not this is copyright violation, not the legality of accessing the material (AFAICT).


So the world is just ignoring the legality of accessing the material in the way that they did?

Man, I wish as a human I could use the defense of "but it was publicly accessible in a torrent" as a valid reason that I acquired and consumed some content.


Sorry, do you have some proof that they pirated content that the scrupulous copyright holders themselves don't? That's not the issue at hand and, AFAICT, nobody is accusing them of doing it.

Nobody is ignoring it, that would be a crime if they did, but since there's no evidence of it... you're being upset at your own hypothetical.


I'm really confused.

I see articles all over that LLMs for example used books3, which the creator himself has admitted came from torrenting from bibliotik, and which contains at least some books that are not otherwise freely available.

The issue that content creators have with LLMs is not getting proper attribution for their creations and source material that went into training these LLMs, and it seems pretty clear to me that some of the content that was used to train LLMs was not legally obtained and licensed to consume.


'we' who?


> We need to find way to make sure the access to this data the models were trained on, and the resulting models, is open and fair

Agreed. The irony is that all the current anti-machine-learning lawsuits will make it so only big companies with deep pockets will be able to continue to make & use them.


> they will soon be good enough that they will be able to replace the work of the people that produced the content the AI was trained on

Let assume this is true. It then follows that if we train LLM on all books that are out of copyright, the LLM would be able to replace the "work of the people that produced the content" that is just about to go out of copyright. Feed this body of work back into the LLM and you should end up with something just as advance as what you get by feeding in works that are under copyright.

I'm not sure the assumption is true. But if it is true, using copyrighted material or only non-copyrighted material shouldn't result in a different situation.


So we need new legislation.

Edit: To be more explicit, on my view both the parent and the op are right, and they don't contradict each other: We don't need new copyright legislation. We need new legal concepts.

I think both people are right: There is an overwhelming need to address the situation of artists and writers ("content creators"). Copyright is not being used because it's the right tool to do so, but because it's the only even vaguely plausible one. It's also clear that it's not sufficient.


We have legal concepts that fit. Ideas cannot be copyrighted. The particular expression of an idea can be copyrighted, but an idea itself is in the public domain. The ideas you or an LLM get from reading a book are in the public domain while the book (the expression of those ideas) is not.

In what way is that not sufficient?


I don't understand your argument. I concur with the GP (https://news.ycombinator.com/item?id=37780772) that there are legitimate issues around LLMs that need addressing. I say that current legal principles seem inadequate to address these issues. Your "counterpoint" seems to merely reiterate the latter. You seem to agree that these concepts can not be used effectively to address issues around LLMs pointed out, as what the LLM extracts is not copyrightable.

Do you simply disagree that there are issues worth addressing?

Also, typically the entity that carries ideas can not be owned by a cooperation. So there is a completely new relationship between property and ideas at play.


There is no legal concept for a machine to have ideas. I agree with you that the current legal framework is clear, but I disagree as to the outcome: without new legislation, the creation of an LLM on unlicensed content seems to be the creation of a derived work without permission of the owners whose work you are deriving from, so the creation of the LLM itself is copyright infringement.


But we already have a way to deal with derived work. Reading a bunch of books and then writing with the ideas you learned is perfectly fine. Copying a paragraph without citing it is not. Mimic the style of authors you want to emulate is fine. Passing off their book as your own is not.

If creating an LLM is copyright infringement, then we could say that a human reading a book is copyright infringement.


But a human reading !== A computer "reading".

Why do people so easily and readily conflate the two? Is it just because it makes the AI hype sound more real, if you compare it 1:1 with a human brain?

Machines and humans are not equal or even in the ballpark of being similar in any way other than superfluous ones, and should (and already are) held by different standards, both legal and otherwise.


And thus copyright is not a good framework for LLMs...

It's obvious that this technology was not considered when developing current legal concepts, so if the concepts produce unsatisfactory outcomes, they need to be reevaluated.


Technically, the money comes from people who pay to use AI (customers), or alternatively to buy art from artists.

To the extent they are out-competing artists, it's diverting a money flow from artists to somewhere else. But a lot of that money might be from people who would never have bought that art to begin with?

In any case, the money never belonged to the artists. It never got to them. The people it actually belongs to (the customers) get to make the decision about what to buy.


I think I agree but that never stopped the trade associations from coming after pirates for lost revenue. Then again, that wasn't really a winning strategy since they spent more on legal fees than they ever recovered from broke college kids and single moms.


Just tax the use of the technology and give back to humanity in terms of less work/ubi/etc.


I'm really curious about the political expediancy of this - the right wing tends to hate UBI but they also resent Big Tech and are probably not fans of algorithms taking their jobs, so maybe we can come to a compromise.


"Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).

According to WP, there are four factors to consider:

1. Purpose, including whether it's commercial.

2. Nature of the work (e.g. fiction/non-fiction, published/unpublished)

3. How much was used

4. Whether the use impairs the author's ability to exploit his work

====

1. The use is definitely commercial.

2. This use consumes anything and everything.

3. This use consumes entire libraries.

4. It can diminish the market for the author's own work (observe authors complaining that GPT simulacra of their work are competing with their own publications on Amazon).

Each of these factors is to be weighed by the court; no one factor is decisive. But, for example, if you use the entire work, factor #2 weighs against you. It looks to me that the four factors all weigh against OpenAI.


The models that are trained on these inputs absolutely do not contain all of the text that they're trained on. Even with non-quantized models you're looking at somewhere in the realm of 3% of the training data being present in the final model.

Further, these models are not used to reproduce copyrighted works... they would be one of the least efficient ways to do so.

I also don't think there's a particularly strong argument that training on a book or image decreases the market for the original book or image. To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

I think what's really happened here is that artists' work has utility outside of the originally intended audience, that utility largely comes from statistically analyzing the writing, and the debate is about whether or not the utility of those statistics requires additional compensation for the artists. That question seems to hinge on the intricacies of copyright law, but as I've stated before, I think the fair use argument is quite strong.


>I also don't think ... that training on a book or image decreases the market for the original book or image. ... zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones ...

For now. But when you can generate work in the style of an artist without paying that artist, I would. More importantly, media production companies will and so will use access to people's post work to put them out of work now.

We need to distribute the benefits, we're drifting towards copyright infringement being allowed by rich corporation (with resources to generate "AI") as long as they abstract it through a transformer, or whatever.


Crypto was a decentralized ponzi scheme. AI will be decentralized mass copyright infringement.


Copyright laws are fascist and should be abolished


> But when you can generate work in the style of an artist without paying that artist, I would.

You can't copyright a "style".


> To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

Why jump straight to one of the longest popular books in English? I'm sure plenty of people who would otherwise buy poetry anthologies are asking ChatGPT to write them a haiku. It's also entirely conceivable that ChatGPT can perfectly reproduce a poem in its training set - it can produce paragraph-sized quotes.


It was a reference to the copyright suit brought by the author.

https://apnews.com/article/openai-lawsuit-authors-grisham-ge...


Where does this 3% figure come from?


These models are generally trained on tarabytes of data, but are usually 10s of gigabytes large (or much less if quantized). The latest true open source model, mistral 7b, is only 3GB (0.3% of a TB) when quantized.


I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.

The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.

Here's my original back of the napkin analysis:

https://news.ycombinator.com/item?id=36681440


Copyright doesn’t apply to consumption of a work, only redistribution. For example, it is not a violation of copyright for you to make many copies of a movie for personal use. Similarly, it is not a violation of copyright for an artist to learn by looking at other works, even if they then compete against the original work. OpenAI’s system is much more similar to calculating statistics from public artwork than copying and redistributing artwork.


> For example, it is not a violation of copyright for you to make many copies of a movie for personal use.

This likely varies by jurisdiction, but in the United States, copyright law covers four exclusive rights:

1. The right to make copies

2. The right to distribute copies

3. The right to create derivative works

4. The right to publicly perform the work

Making copies for personal use absolutely violates the first of these exclusive rights.


Sure if you utterly ignore all of the fair use parts of US copyright law.


The law is not clear, otherwise it wouldn't have to go to the supreme court to contrive the notion of "time-shifting" to allow for making personal copies.

https://en.m.wikipedia.org/wiki/Sony_Corp._of_America_v._Uni....


Sure which is why saying it's absolutely a violation isn't a correct interpretation. So is saying it's absolutely not a violation. Even if the supreme court rules on the issue it's not absolutely anything because that ruling only applies until a future supreme court potentially changes its mind.


I understand how my use of "absolutely" could be misleading. I'll attempt to clarify.

My understanding of the reality:

* Violations of those four rights give copyright holders standing to sue over duplication.

* Fair Use is a defense against that violation.

What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written, and because of this, you expose yourself to lawsuits when you make copies for personal use (assuming anyone could ever find out...I'm not, and never have, argued that copyright law is correct or good). The fact that you can defend yourself using a Fair Use defense is also important and a good point, but still requires hiring a lawyer, going to court, and winning. In my book, this is quite a bit different than saying it's not a violation.

IANAL, though I follow this fairly closely, and I'm not aware of any cases that have established that copying for personal use is Fair Use; I'd be very happy to read through a decision that establishes this!


> What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written

Copyright law “as the law is written” says fair use is not a violation. 17 USC § 107: “Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work [...] is not an infringement of copyright.”


Right, but courts are the only ones that can make a fair use determination.


That's not true other than in the sense that “courts are the only ones that can make a determination that an act would violate copyright law outside of fair use considerations” is also true.

Its true that many times fair use will be somewhat harder to predict what a court would decide on than other aspects of copyright law, because the statute and precedent in combination are less clear, but that's not a categorical distinction.


In the US you can be sued for anything by anyone so it's a question of degrees of risk rather than absolutes.


Yes, I think we agree.


> "Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).

If someone studies every one of Stephen King's books, studies King's style, studies his plots, etc. and uses the knowledge to learn how to be a writer, there is no need to invoke "fair use" unless they are actually copying King's books. Studying an author's corpus in order to be able to produce works like the author is not copyright infringement. It is just the way we learn...and the way LLMs learn as well.

If we want to say it isn't allowed because LLM read too quickly, then we could do that. Maybe make them not read any faster than humans. But LLMs aren't doing something different than humans other than just the speed at which they do it.


If you read the PDF, you'd know that OpenAI admits they ARE actually copying his books, in order to train on them, and they are arguing that this is okay because they don't want to have to obtain a legal copy of the work, because that would be too expensive.

They are literally arguing they should be allowed to illegally download stuff so long as no human reads the downloaded work directly.

To further your analogy, it would be like your prospective writer objecting to being arrested for stealing the books from the bookstore because he just wanted to become a better writer by studying them, not read them for entertainment purposes.


If you are saying that they illegally obtained King's book, then the issue is the fact that the stole the book from somewhere. I don't disagree with your assessment there. But if they checked the book out from the library and let the LLM read it, the fact that it has read it shouldn't be any different than if you had read it as a human.

If I break into the court house to let my child see the artwork in the public areas, the issue isn't whether my child can see, learn from, and copy the style of the artwork. The issue is that I broke into the court house instead of going in the front door when it was open to the public.


> if they checked the book out from the library and let the LLM read it

Did you know it's actually illegal to photocopy books you check out from the library? It's an illegal copy.

That aside, the analogies between humans and ML models fall apart when you actually get into the details, which is why AI companies don't want that.

Human brains don't have RAM and Disk storage that keep direct copies of works. Computers do. Human brains are not entirely deterministic calculators. Computers are.

If you have all the state data of an AI model, you can know exactly what it will produce, and the input data is a necessary component of that state data, so it's not transformative in the same way as a human brain in the first place.

2 humans will not create the exact same output given a shared starting state in the way that 2 computers (AI models included) will.

The only reason people falsely believe AI models are different is because no one actually traces through that state data, and there are enough pseudo-random variables involved and state-changes that they create the illusion of non-determinism, but that is, in point of fact, an illusion.


You are assuming that an LLM is saving a copy of the book it reads.


The LLM is the output of the training system; the LLM doesn't do the reading, the trainer does. The LLM doesn't "contain" a copy of the book.

But some LLMs seem to be able to quote on demand from a wide range of literature; these models evidently do have some kind of memory, even if it's been distilled and blended and distilled again.


No, I am actually reading OpenAI's PDF linked here, where they say they make a copy of the online corpa to train on, right at the top of page 2:

> By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed

Whether they retain that later is immaterial to whether copyright infringement occurred. Deleting a movie you illegally downloaded after you watch (train on) it doesn't make it legal.


I agree with you. What I'm trying to differentiate is between the way you obtain a book and whether training an LLM on a book that you own the rights to read should be considered copyright infringement.


Unless the license you have for the book includes rights for commercial usage, it should be illegal to train a model that is being used for business, imo. In terms of individuals training models for their own personal use, I think that becomes much more tough to answer, but that's not really cropping up in the legal fights.


LLMs don't read, they are trained. Yes, training involves saving a copy of the training corpus.


This reminds me of the Google Books lawsuit back in the 2010s. Google didn't ask permission from every copyright holder but was still allowed to scan books due to fair use.


The big difference is that Google legally obtained the copies of the books they scanned. OpenAI did not. Publicly-accessible != legal to download/ copy.

The copyright infringement in Google's case was not illegally obtaining the original work, it was about the snippets they provided.

In OpenAI's case, it is also about the illegal obtainment (downloading) of the original work, which they admit to in their brief, but argue shouldn't be illegal if they intend only to transform it. That element was not present in AGI v Google.


This very lawsuit is referenced in OpenAI's memo as precedent

> Perhaps the most compelling case on point is Authors Guild v. Google. There, search engine Google digitally scanned “tens of millions of books” without their authors’ permission for inclusion in a searchable digital database.


Yeah, I said that.

You're not the only commenter that has pointed out to me that the model doesn't contain the training corpus; that the model is analogous to a human writer who learns to write by becoming a voracious reader.

I acknowledge that, I think the analogy is fair. The problem is that the memory capacity of the model appears to be humungous; I've never played with ChatGPT, but I understand that it can quote accurately and at some length from a wide range of literature.

By the model, I mean the model behind ChatGPT. I suspect this "memory" amounts to having unimaginable numbers of parameters, i.e. something like overfit. I realize that's just one LLM, and there are other models that are neither generative nor language models, and that have to be much more restrained in what they consider a "parameter".


> Studying an author's corpus in order to be able to produce works like the author is not copyright infringement.

You didn't finish the analogy.

Not just produce "works like the author" but being able to regenerate those original works word-for-word.

Is that copyright infringement? If I have the entire Dark Tower series memorized word-for-word, and reproduce passages or entire chapters on-demand for others. It's just the way some people learn.


Well yes. If you reproduce passages from a book (from memory or otherwise) and present it as your own, that is copyright infringement. We have lots of well established law about how much things can be similar without infringing.

LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.


> LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.

So all of them? None of the LLM's are "functioning correctly"? Because I can get every LLM to reproduce passages from a book.


The copyright violation, if there is one, is committed by the person who extracts data from the model and uses or distributes it in ways not permitted by law.

Not by the person who feeds data into the model... and not by the model itself, for merely retaining an unrecognizable, unusable, highly-compressed and -transformed representation internally.

Seems simple enough. Any other judicial outcome will either halt the process of human intellectual evolution, or transfer a staggering amount of power to people who DGAF about US copyright law. So be careful what you wish for.


>Like you can make a transformer output some part of it's training data - but not an entire work

There's been many examples of getting large verbatim passages of text or large image regions, etc, from AI tools.

I don't think you need to use the entire work to infringe. Certainly lawsuits claiming use of part of the work have had the rights holder prevail.


You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory.

What’s really happening is that AI models have much better memory than humans and are more precise in their output. It would be stupid to try and “dumb down” AI models because they’re better at remembering some licensed content.

In spirit, it’s still fair use.


> What’s really happening is that AI models have much better memory than humans and are more precise in their output.

And yet, presumably we agree that a simple file server that serves up exact copies of copyrighted work does constitute copyright infringement. What's the difference? You could also say "what's really happening is that the file server has much better memory than humans." Duh!

It sounds like you're saying that, because an AI model is a very convoluted and sometimes inaccurate way to implement a computer system that sometimes serves up exact copies of copyrighted works, it's not copyright infringement when that computer system does serve up an exact copy of a copyrighted work. I'm not quite understanding the argument.


Two things:

1) I’m generally copyleft and would argue that copyright as we know it is nonsensical in the digital era and needs to be entirely rebuilt to make any sense whatsoever. And that as is it harms the commons more than it protects creators. So yeah a fileserver has always been a game changer just like the printing press was and we’re far behind as a society, legally.

2) When a computer system does reproduce copyrighted content verbatim, it’s infringement the same as if a human did it from memory. That wasn’t my point. My point was that use of copyrighted content to train the model is fair use because it’s no different from a human consuming the content and committing it to memory.


Copyleft licenses are a form of copyright. Unless it is some other meaning I am not aware of with left-right word play.


On one hand, copyleft licenses are a creative form of copyright to enforce the wishes of the author to allow derivative works under the condition they are also distributed freely. Let's call it weak copyleft, the pragmatic variety.

OTOH are copyright abolitionists who are offended by the notion that their freedom to copy and modify code on their own harddisk could be restricted by a mere "license", to them the notion of intellectual property is poppycock. The Strong Copylefties consider GPL a necessary evil, a way to use their enemies' tools against them, to spread their ideals of free culture amidst a corporatist hellscape.

Or it could merely be someone taking the naïve reading of "anti-copyright"


But the human that consumed the content paid for it didn't they?


What if they heard it on the radio? Or read it in a library?


The model didn't hear it on the radio or had to access a library, it was directly fed raw copyrighted media. There are no such nuances here.

The act of getting heaps of copyrighted media for commercial use without paying clearely breaks copyright


So if I hear a song on the radio and it inspires my commercial purposes, then what?

Point being: whether a work is used commercially is not relevant. It's common that we think it is, but it's not. I first read about an LRU cache in my operating systems textbook and later used the concept in a commercial work. I have not committed copyright infringement.


Then nothing - you didn't gather massive amounts of raw copyrighted goods and consumed them in a comercial goal without paying.

This LLMs are not trained on the radio, especially not exclusively.

Edit0: Did you get access to this book that inspired you legally, as in at a library, during a class, or having bought it yourself? Was it fair use?

Because none of those look like getting heaps of copyrighted stuff for free and claiming you didn't.

If instead you stole the book - then yes, this is similar. I don't care what you do with it - can't steal it and assume you are respecting copyright.


I am a product of the impressions left by massive heaps of copyrighted content. One song on the radio is just a rhetorical device.

If OpenAI rented all humanity’s media from a library and used them to train an AI model then that seems 100% ethical to me.

Now if you ask the model to recite the script to Breaking Bad and it does so perfectly and I think that grants me copyright authority over it then we’re going to have problems. It’s just not the model or tool’s problem.


Did OpenAi rent it? Or did they get it without respect for copyright?

How about you?

This is the point


You’re lost in the weeds. I know that’s the point it’s why the whole song on the radio thought experiment got brought up. The question was, if an AI model trains on public radio waves, and hears a copyrighted song, is that infringement? My position is no, it’s not because the radio station had a license to broadcast that song on the radio.

Similar, if all the books used to train a model are available in the library, so long as someone rents the books, then they can be used to train a model.

The question was directed at you. I don’t know why you’re repeating it back to me like I didn't know what I was asking…


Feel free to chat with a lawyer on copyright


The file server is only infringing when it serves those files. Photoshop itself isn't infringing just because someone recreates a famous art piece in it, it's the end user that is infringing. The difference between server driver vs user driven creation shifts responsibility.


Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

The courts are going to rule in favor of these authors if they have a basic understanding of what’s happening.


> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

They actually are not that different from Photoshop. Regarding providing you copyrighted materials, if you instruct it to generate something someone else already copyrighted, perhaps by using a feature meant to reproduce existing art style, you will generate an infringing work.

As for "ingesting of other copyrighted works to work optimally", you don't know what goes into designing and building Photoshop - how many third-party datasets or copyrighted assets, which get embedded deep into the application in a form that the end-user cannot discover or consume. You don't know, and it doesn't matter, because Adobe using copyrighted materials in building Photoshop does not propagate copyright claims to you the user/renter of Photoshop. Same can be argued about LLMs - copyrighted inputs from training set get sufficiently blended when turned into weights that you, the end user, should be shielded from any IP claims related to the training data.

(Yes, the last point enables "copyright laundering", but I'm not convinced this is a problem - not compared to regulatory environment trying to prevent it.)


My point is simply these models should not ingest copyrighted materials without paying the authors or publishers. Litigating end-users who are using LLMs that have ingested pirated copyrighted would be so complex as to not be worth it except for large businesses. If I want to ingest/read a book I need to pay money but if an LLM does it they’re free to pirate the book? Why?

And why is it that I pay OpenAI to generate data based off books it stole for free when I had to pay money for the same book.

I know what I am actually paying for is the model obviously but it just feels extra wrong to be paying a company for a service it’s built using pirated content.


>If I want to ingest/read a book I need to pay money

Or you just go to the local library. Or you borrow it from a friend. Or you stand in B&N and read it in the store.

You're arguments are not great.


> Photoshop doesn’t provide you with copyrighted materials to work with when prompted or require ingesting other copyrighted works to work optimally. Language models do both.

Photoshop now has generative AI features that leverage language models as well as training on imagery, so this is literally false when discussing Photoshop as it currently exists.


What about a file server which hosts only encrypted files, which are unusable garbage on their own, that happen to turn into exact copies of copyrighted work when supplied with the right decryption key? That's user-driven creation, right?


I would say it’s reproduction of an exact copy that represents copyright infringement, not dissemination of the digital brain that remembered it.

If and when someone tries to profit off an AI work that would be copyright infringement if a human had made it, it should be copyright infringement of an AI does it also.


I'm not disagreeing with AI model training fair use, but this isn't the argument for it.

"(new tech) does the same thing as humans, just better" has never been a valid defense. It's like saying a human could explain the plot of a movie and draw the scenes, therefore it's okay to bring a camera into a theatre and record a movie and distribute it. Or that a human can hear a conversation and remember what was said, so there's no distinction between that and recording the conversation using a phone.


But an AI model doesn't record the original work verbatim with the goal of directly reproducing the original.

Aside: and you can bring a camera to a theater and record a movie and use it in a transformative work. And a human could still liable for damages if their hand-drawn performance of Star Wars detracted from Disney’s revenue. I’m not saying I agree, just stating tue status quo.

Training a model uses the work only to calibrate weights that govern entirely independent output. The fact that it can recall exactly in some cases is a secondary effect of the technology.

Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.


> Anyway my argument is that “ability to reproduce verbatim a copyrighted work is not a valid characteristic when determining whether something consumed the work fairly”.

I agree with this. I only disagree with the assertion that AI or $newtech "doing something humans already do but better" has any legal importance. There are many existing laws which apply only when using technology. It's legal to drink and run, but not drink and drive, even though they both get you from point A to point B and cars just do it faster.


> You can also get “the whole work” by asking another human to recite the lyrics

Is that human earning money by reciting the lyrics? If so, that's already not allowed. It's called a performance.


Being paid is not what makes it into a performance. Having an audience, and the purpose of the recitation, are what make it performance.

If I pay a babysitter to look after my kid and they sing the child a song to get them to sleep, it’s not an infringing performance.

even if you are paying ChatGPT to answer your questions, if you ask it to tell you the lyrics of a song and it does so, that is not necessarily infringing.

If I am preparing a legal brief for a copyright case, and I pay a paralegal to transcribe the lyrics of a song, and they do so and send them to me in an email… is that copyright infringement? It seems very unlikely.

I just can’t come to any position on LLMs other than that the users of the LLM have to be held responsible for how they choose to use the output, not the LLM provider.

LLMs need to be aware of the content of copyrighted works in order for them to be able to fully and comprehensively communicate with humans who are immersed in and aware of the content of copyrighted works.


If the LLM is recognized as infringing, could any book Publisher sue HP when someone prints a copy of their book through HP Smart?


that's only half of it, the half that's been litigated via Xerox and Betamax, no - the manufacturer is not liable for what end users do with their product.

But what Xerox and Sony didn't do to build their machines is pirate everything they could get their hands on as a part of the manufacturing process.


Who says OpenAI pirated it? Unless the content was pirated in the first place simply showing it to an LLM is just like letting your friend borrow your book.

When Google crawls websites to build a search index, we don’t expect Google to pay royalties… all these analogies at least demonstrate that copyright is impossible to apply consistently and our notions about what’s fair are wholly subjective.


> Is that human earning money by reciting the lyrics? If so, that's already not allowed.

This is a common copyright misconception. Whether or not you earn money from it doesn't determine whether or not it's a performance of the work.


True, my point was that the LLM is performing for the public. Thanks for the clarification.


Covers are generally transformative, right?

Edit: I guess even if they are SCOTUS recently decided that even transformative works can infringe if they compete commercially. So the question is not: “did the reciter make money performing”, it’s “did the reciter’s performance detract financially for the original artist”.


Song covers are also a special case with something called a "compulsory license" where the copyright owner is required to license it to you, can't be denied. You just do it and and pay them preset royalty rates.


That is only if it is recorded. You can't get a "compulsory license" if you are going to perform that cover in front of an audience. That you have secure from the rights holder.


No, covers aren’t considered transformative. I think a great deal of this discussion comes from people not understanding copyright.

https://attorneyatlawmagazine.com/public-articles/intellectu...


I personally think our law regarding covers has been heavily influenced by record labels and is wrong about them, in practice. I’ve already heard the original enough times that if I want to hear a string quartet perform Viva La Vida it’s because it’s a new refreshing piece of art. The whole compulsory license thing indicates the law got it wrong. Anyway…


The song is distinct from an artists performance of that song, but let’s move outside of music to books etc.

Suppose A creates an epic poem, B does a poetry reading aka a performance of A’s poem. C records it, then plays that recording back in public. D reinterprets the poem making a new public performance. B can successfully sue C but not necessarily D if it’s sufficiently distinct. A however can potentially sue D, C, and B if none of them got the rights. [Substitute A making a painting and B making some needlepoint copy or whatever and the same principle applies.]

This is why JK Rawling got paid by the people making Harry Potter movies, she could sue if they didn’t pay. Trademarks may also be involved, but even without that if you want to make an MMORPG based on The Dresden Files or whatever be prepared to fork over cash. Unless you follow the Disney approach and use public domain works.


'Covers' require the payment of ASCAP fees -- usually done by the facility, not the performer, but if the facility does not pay, the performer can be liable. You even need a license to put a jukebox in your bar, so that argument doesn't hold up.


The AI tools are often representing the output as something their customers can use without restriction. I'm pretty sure that wouldn't work in your analogy. If I'm an agency and a customer asks for jingles, can I recite large parts of lyrics of copyrighted songs for them to use...as if I made them up?


This is not a new phenomenon.

Google image search produces copyrighted and restricted use images. On clicking an image it includes a little caveat warning “Images may be subject to copyright. Learn More” - but no specific attribution or copyright claim. It’s possible if you go to the source where Google found it you’ll find the attribution there but also very likely you won’t.

If an AI tool just says ‘this might be subject to copyright’, is it all good?

Words to that effect appear, for example, in the GitHub copilot terms and conditions.


There are people claiming everything coming out of these tools is transformed and wholly unencumbered.


Yeah I’m understanding the nuance much more now. There is a difference between: is it okay to use copyrighted content to produce OpenAI’s product, and is a verbatim reproduction of a poem fair use.


That’s a naive ‘what color are my bits’[1] mistake - classic software developer mindset.

Level one programmer naïveté is just ‘bits are bits, it doesn’t matter where they come from. Bitwise identical things are indistinguishable’.

Level 2 naïveté is when you accept that bits have color depending on how they came to be arranged thus, and that there are processes that get rid of the old color on some bits, and replace it with a new one. But then you figure - like a programmer - that if you compose that process with some other process you can get rid of the colors you don’t like.

Enlightenment is realizing that the law cares not one jot for the specific processes you apply to bits or their colors but criminalizes (or at least proscribes) particular actions and cares about things like intent.

[1] https://ansuz.sooke.bc.ca/entry/23


How is this any different than something like Photoshop? You can recreate (and therefore copy) a piece of art and it's infringement, but not on the part of Photoshop. Yet, Adobe is still well within the right to say you can use what you create with Photoshop. Why can't AI tool makers have the same claim?


AI tools themselves may be the infringement. Photoshop is not.


"You can also get “the whole work” by asking another human to recite the lyrics to a song or draw the Finder logo from memory."

You seem to be speaking as if this somehow would cleanse the copyright status of the work in question, but it wouldn't. If you memorize a book, or a friend does, and your or your friend recite it to someone who transcribes it, the result is still copyrighted by the original entity, and if you try to sell the result, you'll be on the hook for copyright violation. This would do nothing to the copyright status whatsoever, so whatever argument you're trying to imply doesn't hold.


The only difference between the two cases is which human violated copyright. If you ask a musician to play a cover of a famous song without the requisite royalties they violated copyright. If you instruct a machine to do it you violated copyright. Machine has no thoughts, head empty, does not know what copyright is, does not know abc's.

Which is also why machines can't create copyrighted works either. The standard example is that making a machine to generate random images doesn't copyright them, but if an artist chooses some of them because they look good then he may copyright them.


> In spirit, it’s still fair use.

Spirit crafted by people who couldn't even begin to imagine LLMs. The proper answer here is new laws clarifying the stance of copyright and LLMs/generative models in general, not trying to reason like a 19th century person about 21st century tech.

The EU's approach is much more sensible - there is this new thing with vast ramifications, let's sit down and see what legal framework is needed for it.


People know when they are reciting something they have memorized. Do language models not remember having memorized what they're writing?


How large? Copying some verbatim is still fine under fair use.


Copying entire works is fine under fair use in certain circumstances, too. (See, e.g., time-shifting.)


For written text? Not aware of any examples, can you share any?

(To be clear by copying I mean repeating the text in your derived work, perhaps I should have been more explicit. The crux of this case is whether “copying into training set” counts as copying the whole work. )


The first examples that comes to mind is reproducing an entire copyright-protected article, either a single copy for personal use by a researcher or multiple copies for classroom use.


That's an interesting example because it illustrates out how fair use is in part defined by its effect on the market.

A teacher copying a New York Times article verbatim and distributing it to a classroom for the purpose of analysis or criticism is likely to be held by the courts as fair use. On the other hand a teacher photocopying an entire textbook (or distributing a PDF) would probably not be considered fair use.

Both are copying educational material in full, but they have a different impact on the market. Distributing copies of an old news article to 30 students realistically isn't a direct substitute for 30 subscriptions to The New York Times. But a textbook publisher's business model specifically revolves around selling its textbooks to students - the teacher is distributing their copyrighted material to their target market and removing any incentive for the students to pay the publisher.


Yes, the damage to the market of the original is a key factor in a fair use analysis.

It makes you wonder why "Open"AI started getting into legal trouble only when they closed everything down and became a commercial project. When they were a research project, stepping on copyrights was a lot more acceptible (both socially and legally).


Right, determined fair use by the supreme court in Williams & Wilkins Co. v. United States. You can walk into a library and make a wholesale copy for research purposes. The case involved a medical journal publisher suing medical doctors, who made a strong argument that their ability to make copies was a matter of life and death. Winner: Xerox.

https://en.m.wikipedia.org/wiki/Williams_%26_Wilkins_Co._v._...


Have you heard of 'quotations'?

Plenty of works quote and cite other copywritten works and it's considered transformative because usually the work is building upon or commenting on the quote.


The context was copying entire works in this particular leaf of the main thread.


Yes, and plenty of people write criticisms of entire song lyrics or poems, effectively republishing the entire thing.

This is actually a huge area of debate on Youtube as Google often errs on your side of the argument (they can choose to demonetize whatever they want), even if they work is technically not infringing.


Copying verbatim passages can be fair use if it is transformative.


Outputting unattributed copies large passages of text to end customers with the implied okay to use it any way they want, though....that's what these tools do.

The end user often has no idea they just received something with potential IP issues.


Does the model even have enough information to be able to _know_ that though? If it's simply using probability to ties a series of tokens together into text based on numerical probability, that doesn't store enough information to understand that _this specific sequence of tokens_ represents come specific copyrighted work. Storing that information would _actually_ seem to fall afoul of copyright.

The fact that it can spit out chunks of copyrighted works is driven by the input token sequence and the model weights pointing to a specific path that have an ever so slightly higher probability of being the expected output, right? It's not like the model stores the copyrighted work directly.

(Yes, I know the algorithms are more complex than what I expressed, but the general idea holds in my understanding)


The model doesn't know anything - people personify LLMs too much. It is a mathematical text predictor that has almost certainly ingested the text it is copying verbatim to string together the words it is reproducing. The fact that it is a highly compressed representation of its training corpus (and thus doesn't "know" that it is copying something) is not an excuse.

I think you could make a good argument about this if you could prove that the text being spit out verbatim is _not_ contained in the training corpus, but that is not the situation we have today.


> The model doesn't know anything - people personify LLMs too much.

Perhaps reading people's posts little less literal would help the conversation. I obviously know that the model weights don't 'know' something in the sense that a human knows something, but the model does store information. That information happens to (primarily?) be the statistical likelihood of one token following another token. What it doesn't store is a string of tokens that represents Sarah Silverman's latest book.

From what I can tell, all this angst comes from 3 or 4 related, but different, issues.

1. Did companies break copyright laws when assembling and using training for these models?

2. Does the model represent some form of copyright infringement in and of itself?

3. Does a model's ability to output chunks of copyrighted work have some implication of the legality of the model itself? (using said copyrighted chunks is already a solved issue)

4. Do we as a society owe it to humans benefitting from copyright the continued ability to create copyrightable content without competition from ML models?

I think comingling all of those points is doing everyone a disservice.

My assertion was only about #2 and none of the others. I feel like it's a clearly demonstrable situation that these models _don't_ infringe copyright directly. That being said, I am obviously not a lawyer, and my opinion is just that, an opinion.

FWIW, my general feeling on all of the points is: 1. Quite Likely (but fair-use is a fickle thing), 2. No, 3. It shouldn't, and 4. No, but we need to think through the long-term societal implications of ML decreasing the amount of human labor needed across all markets and come up with a plan that doesn't involve our fingers in our ears.


I would assume that #2 is actually "yes" given the way derived works work and the fair use tests, and that #1 and #2 are actually very much linked. Fair use is a defense to copyright infringement, and it's a relatively complex balance of factors. It's relatively inarguable (even OpenAI isn't arguing this) that the model isn't a derivative work of its training set: they are just arguing that they have fair use rights to the contents of the training set. Only one of the factors in a fair use analysis is how transformative the use is, and I think it's hard to argue that training an LLM isn't a huge transformation. However, the other factors weigh pretty heavily against LLMs here, and the Author's Guild lawsuit is a pretty good set of arguments as to why. It's up to a court to decide whether the transformative nature outweighs the other factors.

If you're lumping the fair use question into #1 and the "is it a derived work" question is #2, I'm pretty sure that nobody on any side of this agrees that it isn't a derived work. Once you have a derived work, you can either ask whether they were licensed to produce that work (no) or whether it was fair use (possibly).


On a purely technical level, I feel very strongly that the model is not a derivative work any more than a dictionary is a derivative work. The model is a recording of token stream probabilities. The training data is, in my mind, where the conversation about derivative work and fair use comes into play. Again, I'm not a lawyer and have little exposure to the legal profession, so my opinions are my own.


There's a big difference between a model and almost every other form of compilation of human knowledge, which is when the human intervention occurs. A dictionary is a collection of definitions from humans that is derived from their experience reading texts. Human knowledge and experience moderates the flow from the text the human reads to the text in the dictionary. LLM training has no such moderation, and that is a significant factor in whether something is a derived work or not.

This is really all up to the courts, and I don't think anyone is confident about how it will shake out. However, the fact that OpenAI isn't even trying to make the argument that an LLM isn't a derived work of the training set (they are going straight to fair use, which is an acknowledgement of infringement) suggests that this is not actually contentious.


We might also consider that the LLMs will take traffic away from original sources, much like Google Search showing the relevant passages that answer your question. It's not always about outputting the exact passage, though this point may be tangential to the post & case at hand.

My primary usage of LLMs fit this pattern. Use Bard for simple programming questions, typically something I would have previously searched for via docs/reference/SO for. Now I don't have to filter the results or skim the pages for the right section, it's just given to me, and I can ask/make follow up questions or corrections. This here is what makes them much better than vanilla search, the ability to iterate and refine.


I think in some ways LLMs have shown people want simple answers to their questions when they pose it to a search engine.

Stack Overflow was good because it had answers (one chosen as right) and a little bit of discussion about why and how and other considerations.

Much of the web is verbose to an extreme to get the search engines to rank it highly. People don't want that, they just want answers.


This is exactly right, and hopefully the LLMs can get better at adding in the

> discussion about why and how and other considerations

If(when) they knew when/how to ask better follow up questions, that will take it to the next level


> We might also consider that the LLMs will take traffic away from original sources, much like Google Search showing the relevant passages that answer your question.

That said, Google has motivation not to over do it though as millions of websites are part of their ad network. If people stop visiting the sites, they lose money too.


Paradigms have changed, chat will be a primary modal for search in time. This is why Bing and Goog are both finding ways to get ads into our chat experiences. We might be forced away from ad dollars being the primary way to pay for the internet. With data privacy, ideally content based ad placement over individual targeting (think magazine style ads), the move to paid access to sites or content... there are trends that may combine to break the current foundations.


I'm all about getting rid of copyright.

This is...bullshit.

> patterns inherent in human-generated media

That's what you hope your model is, not what your model literally is!

> The aim of this process

That's a lovely thought. Now let's get back to talking about the process itself.

I'm not interested in treating LLM tech like a black box. That's intentional stupidity.

OpenAI's entire argument hangs on not understanding how its tech works! I can't see any difference between that and fraud.


>OpenAI's entire argument hangs on not understanding how its tech works! I can't see any difference between that and fraud.

This is how I see it as well.

If you can't explain, in detail, how your mechanism is not violating copyright then you have no defense.


That's not how law works in the US. You are presumed innocent and there must be a preponderance of evidence of guilt.

If they can't prove the AI model contains or reproduces exact copies of their copyrighted work, there's no case to be had.


No one said otherwise. There is already a case against them, their defense is that they don't know how it works, so it can't be true. This was clearly laid out in the comment I replied to, and the comment that reply was made to in turn.


But we can. There are countless documented instances.

OpenAI describes these as "limitations". A more accurate word would be "features".


Is this true? As far as I am aware there are zero pixel for pixel recreations by an image generating AI of any pre-exisiting images and no written work has ever been word-for-word reproduced in full. The closest I've seen are images highly similar to the Mona Lisa which is already copied virtually everywhere without AI and brief passages from commonly copied written works that veer into hallucinations after a paragraph or two.


Why does it need to be exactly the same? You have shifted the goalposts to absurdity.


Because the parent comment says "But we can" in response to the claim that AI can't generate "exact copies." Generating exact copies would be a lot more legally notable than outputs that are only similar or partially copied. As far as I know exact copies haven't happened and was legitimately asking for cases where it has.


> That's not how law works in the US. You are presumed innocent and there must be a preponderance of evidence of guilt.

true for criminal cases but this is a civil?

so it's balance of probabilities, not beyond reasonable doubt


civil cases is preponderance of evidence.

criminal is beyond a reasonable doubt.

in any light, yes civil cases are about whether you’re more likely than not to be liable


Isn't the point of transformer training for it to learn to imitate the distribution of the training data? While concepts of "imitating the distribution" and "copying verbatim" are different, they are not too far off each other either.


I expect it is very much the entire point to copy the human intellect producing individual arts (a.k.a. patterns inherent in human-generated media) So they may be replicated by other individuals lacking the specific intellect or training to produce their own


You are reproducing the abstract content, not the work itself. If your argument was valid, summaries of books and offerings like Blinkist should be illegal as well.


When "abstract content" has sufficient cover to generate indistinguishable replicates of the original is it still abstract?

In contrast I could read Blinkist for years and it would never make me write like Hemingway.

Note: I am arguing for sport here and not from some deep resentment of reality.


See, for me this section also jumped out, but for the opposite reason.

They are intentionally being deceptive about the aim of AI training. The aim is not to create a model, for the sake of having a model, the aim is to create a model that produces content for "direct human consumption".

I could use their argument to say it's fine for me to illegally download movies because I intend to simply create a corpus of works, and whether that corpus will be consumed by humans afterwords is (by their logic) irrelevant. Or that I should be able to download them because I only intend to transform them into AMVs.

You still have to legally obtain the original work for Fair Use to apply, and in many cases they are not legally obtaining them. "Publicly-accessible" has been ruled repeatedly in copyright law not to inherently imply "legal to copy", but that is exactly what they're claiming should be applied here.


But the creation of a useful generative AI system is just an intermediate step toward the ultimate goal of producing output for direct human consumption.


With Chain-of-Thought and Mixture of Experts, it may be that, in the long-term, most LLM output is primarily consumed by other LLMs [1]

[1] (meme, the future of communication) https://www.reddit.com/r/ChatGPT/comments/123tyge/the_future...


Yes but there is transformative work involved in the process, which is what the model does. It's not a non-transformative simple display of the work (say when I show a movie), but deeply transformative.


But it's still illegal for me to download a movie without a license, even if I just intend to make AMVs with/ "transform" it.

OpenAI's entire argument is that they think this shouldn't apply to them, because they don't want to have to make deals with the original content authors in order to get legal copies to use in their training, because that would be expensive.


It wouldn't just be expensive, it would basically be impossible to figure out the content owners of random works on the internet at the scale needed for the LLMs to output the high quality content that they do output. Even if they strike deals with the large content owners, they need way way more content than that.


Too bad. No one has a right to mass data collection.


Musical genres are a weird parallel to me: humans "training" on a giant dataset and creating an output that is differentiable from any of the source material. We've decided that using that "training data" is fair use.


Humans are not AI models


All art is a variation of other art and the first art was a variation of nature.


Makes sense that a transformer is transformative lol


I think that even summaries are not fair use when they get to a certain length. The Cliff Notes company has to license the material that they summarize, why not OpenAI?


Do they have to, or do they choose to, to avoid a costly court case?


They have to, it's a "derivative work".


The point isn't the point.

The reason a work was created is immaterial to copyright law. It would be like saying, I'm not consuming this media for enjoyment, but to learn how to create media, therefore it's ok that I'm not compensating the creators for that media.

That would not fly.

They want to be able to flagrantly ignore copyright because the whole enterprise becomes prohibitively expensive otherwise. You'd have to go and track down the source of every item fed into the model and request permission. Even without actually paying anyone, that's a time investment no one has.


> The reason a work was created is immaterial to copyright law.

Quite to the contrary, it is central and not at all immaterial to fair use, which is both a part of statutory copyright law now and, more importantly and even before it was incorporated as such, a constitutional limit grounded in freedom of expression on the power of the Federal government to have and enforce copyright law.


Give me a case where the fact an original "was created for human consumption" was used as a successful fair use defense.


The linked paper describes the conditions for fair use as given in 17 U.S.C. § 107, of which the first seven words are "the purpose and character of the use".

Law is a field where context is essential: to analyze a particular set of facts you need not solely the text of the law, but the full historical use and interpretation of that text in regards to other sets of facts. Speculation based on what one imagines the law to be is rarely useful in understanding law or legal arguments.


The purpose and character of their use, the infringing use, not the purpose and character of the original.

The rest of that segment is "including whether such use is of a commercial nature or is for nonprofit educational purposes;"

And then you go on to say "context is essential" while using the text of the law to justify your opinion. The "full historical use and interpretation of that text in regards to other sets of facts" has shown that original purpose has little to no bearing on trying to claim fair use.

Also, the list is not exhaustive. Those factors are considered, but they're not the entire test. Even the purposes themselves are examples, not an exhaustive list.

So why the original was created has no bearing on whether or not you've infringed copyright.

Their main thrust however seems to be that they're metaphorically making paper mache whales from shredded copies of whale books. Yes, the entire text of Moby Dick is on that whale, but it is not Moby Dick, you can't extract Moby Dick from it, you couldn't even tell Moby Dick was used in its entirety.

Which is a much better argument than "bruh, we gave it to a computer".


Neither of us is purporting to discuss the original intended use. Your quote:

> It would be like saying, I'm not consuming this media for enjoyment, but to learn how to create media, therefore it's ok that I'm not compensating the creators for that media.

> That would not fly.

The copyright statute explicitly contradicts you, by requiring courts to consider the purpose and character of the infringing use. Perhaps you meant to make some other argument?


The implication is that the creator made the media for enjoyment and not for study, therefore if you're not enjoying it you can't be infringing on copyright. I can see how that was not clear in my statement.

Because part of OpenAI's argument seems to be based on the intent of the creator on how the media is to be consumed. Which doesn't matter.


If you liked reading this, you might also like reading through the Authors Guild class action lawsuit against OpenAI, filed a couple weeks ago: https://authorsguild.org/app/uploads/2023/09/Authors-Guild-O...

(Notably, John Grisham and George R.R. Martin are plaintiffs.)

It cites OpenAI's justification (which is how I found it) and tries to refute it. Whether the court will agree remains to be seen.

Other lawsuits: Silverman vs OpenAI https://www.courtlistener.com/docket/67569254/silverman-v-op... (complaint PDF: https://storage.courtlistener.com/recap/gov.uscourts.cand.41...)


> It cites OpenAI's justification (which is how I found it) and tries to refute it.

I actually didn't see any refutation in either complaint. I'm not sure that I would have expected to, given that fair use is an affirmative defense, but I would certainly be interested in an attempt to address it.


Partly. One of the refutations is that they say there will be substantial economic impact on authors from generative AI, which is one of the four criteria in OpenAI’s fair use claims (point #4).

You’re right that the complaint seems to go “all in” on this one particular point, rather than listing reasons to refute points 1 through 3. That seems surprising, but maybe it’s a good legal strategy since there’s less attack surface to raise objections.


The economic impact discussion to my reading is actually intended to establish standing, by showing a concrete and particularized harm to the plaintiffs. Generally, a well-formed complaint is trying to establish that there exists a 'case or controversy' which the court is able to redress, not necessarily to answer all questions of law, and generally not to make prophylactic arguments about your opponents' hypothetical defenses. The USPTO document is being used as an example of OpenAI admitting copyright infringement, which they seem to have been happy to do -- one hopes that they fully anticipated this lawsuit.


Just wanted to say, thanks for co-reading it with me. This was fun, and your commentary is insightful. I hope you have a nice week(end).


One thing to note from the USPTO doc is that it describes the reproduction of substantial parts of the source corpus as an error condition. This bears on part of the Author's Guild complaint, that ChatGPT was at one point able to reproduce chunks of copyrighted text. OpenAI would like to persuade USPTO that this would be an erroneous output of ChatGPT. This seems like a persuasive position, but not necessarily dispositive.


Summary of their arguments for why training on copyrighted datasets is OK:

(1) Training AI systems on large copyrighted datasets constitutes fair use under current copyright law. They believe it is highly transformative, does not harm the market for the original works, and serves a different purpose than the original works.

(2) If it were an infringement, it would severely hinder AI progress, which has many societal benefits. Copyright law should promote creativity and innovation.

(3) Other legal tools like infringement suits on AI outputs, anti-scraping laws, etc can address valid concerns instead of rigid copyright laws.

Expanding on (1), they say:

(1.a) The corpora used for training are not made publicly accessible - they are only consumed by machines during training. So authors do not lose any potential audience.

(1.b) The outputs of AI systems do not regenerate significant portions of the original works verbatim. The systems learn general patterns, not the specific expression of any particular work.

(1.c) Well-constructed AI systems do not overfit on any particular work. Each work contributes very minimally to the overall model.

---

I think we should not be stopping progress, but this document 100% minimizes the impact/harm.

Soon AI writers and programmers will be so good, that even if they do not reproduce copyright content verbatim, they will be able to replace large % of the future work of the people that produced the copyrighted content the AI was trained on. This seems like a clear harm.

Given that the datasets are the product of the whole population, the whole population should be able to reap the benefits. As things are heading, only the largest corporations will be able to squeeze profit from the most powerful models.

This goes against (1.a), if one company can use the data for training, it should be made easily accessible to everyone.

Side note for (1.b), many companies implement this as a dumb filter on top of the LLM output, making sure that no N character substring appears in some copyrighted text.


> Soon AI writers and programmers will be so good, that even if they do not reproduce copyright content verbatim, they will be able to replace large % of the future work of the people that produced the copyrighted content the AI was trained on. This seems like a clear harm.

Note that "harm the market for the original works" is only a part of Fair Use and harming the original work does not necessarily make your use of it 'not fair use'. Movie critics using movie clips can absolutely destroy the market for the original work while still being fair use criticism.

Now whatever the USPTO says/does is important for now, but it's a backdrop to the legal challenges that will arise. Because the USPTO has to adhere to new laws and must follow whatever the courts rule regarding how copyright is defined, if the USPTO tries to say "ai is not fair use" then the courts can disagree with them and issue a Court Order to recognize certain copyrights, whenever someone challenges the USPTO's actions.


Movie critics harm the original work / producers of the work in different way: they dissuade the consumption of the work, rather than providing a 1:1 replacement of it.


I'd argue that a lot of people love to watch a movie only to find its flaws, and CinemaSins might remove their need to go buy/watch the film legally. Although that's probably 1/100th of the market for a movie, if that.


I cannot abide the argument that obviating work is a harm. That's luddism, which has proven every single time to be more harmful in the long term than embracing new technology, which may have harms but only in the short term.

I also think you're overestimating the short term harms. I believe human writing will stay valuable, especially the premium creative variety. It's the un-creative, mechanical writing that's becoming somewhat obsolete.


> I cannot abide the argument that obviating work is a harm.

That's a misrepresentation of the argument. The actual argument is that taking someone's work sans-permission and building on it in order to obviate the original work is a harm, because you would not have been able to obviate their work without their work as a starting point.

In other words, you didn't do most of the work, you just altered it, and then want to claim credit (and legal ownership) for it wholesale.


I'm not sure whether the courts will determine if this is fair use or not. It's a difficult concept to define rigorously. If it's found to be fair, I'm arguing that its effect on work cannot be considered harmful to society in the long term.

"you didn't do most of the work, you just altered it, and then want to claim credit (and legal ownership) for it wholesale" is a good description of all scientific research that has ever occurred.


> "you didn't do most of the work, you just altered it, and then want to claim credit (and legal ownership) for it wholesale" is a good description of all scientific research that has ever occurred.

And this is why we shouldn't allow copyrights or patents for broad concepts, only for specific methodologies/ implementations/ designs (patent) and exact works (copyright). But in this case they are not starting with a broad scientific principle like e=mc2, they are starting with an exact work/ book.

Learning the concepts taught in a textbook and then making your own textbook fro the generally-derived knowledge is ok.

Taking the textbook and creating a list of notes, and then making a textbook from those notes, is copyright infringement and plagiarism.

ML models are not learning anything about the concepts behind the works they analyze. They are creating patterns of metadata about the content of the work itself, not the concepts therein, and creating an approximation of the content from those patterns in turn. This is directly akin to the metadata notes approach.

A more direct analogy would be reverse-engineering of software.

Legal RE is entirely possible, but requires a strict non-analysis of the software content itself, only the inputs and outputs can be analyzed. Never the software itself.

If you examine the content of the software, even if you do not actually copy any code, you are (potentially) committing copyright infringement or even IP theft.

All ML models do the latter; they look at the book/ song/ art content itself, and then attempt to produce something with similar effects.

Is the case a sure win against OpenAI? Not at all; courts are notoriously non-technical, and prosecutors may also not have the technical knowledge to properly contextualize the actions of OpenAI in a way the judge(s) can understand.

But do I think that based on all evidence it should be a sure loss for OpenAI? Absolutely.


That's a fine ethical argument for progress, and copyright law is indeed about encouraging the arts and sciences alike, but OpenAI is trying to specifically make an argument about "Fair Use", which does consider whether the new work obviates the old one and prevents those original authors from being compensated since people will just buy the new one instead.

What OpenAI is offering is a commercial product that obseletes an entire industry of creative output - why pay any author or songwriter a dime from here on out? Just pay 19.99/month to OpenAI.

If that's progress in technology, great, but they're arguing that the training data and subsequent transformation is "fair use" - that the original authors get no say and no compensation.

I'd be all for commercializing the collective works of humanity if we got to redistribute the profits to those producers who actually made the superintelligence possible (ie, everyone) but whats not cool is a single commercial entity taking the worlds' copyrighted work and turning a profit from it without a single dime in due compensation.


This notion that a "single commercial entity" stands to benefit is very wrong [1]. There will be fierce commercial competition, and many open-source non-commercial options, for everyone to benefit from this technology.

I'm all for redistributing the gains from automation and economic concentration generally. I see the ideal future as one where so much of the economy is automated that everyone is free to spend their time as they choose, like doing creative or social work. But I don't mean for this to become an argument about UBI. I just don't see any difference between your arguments and the classic Luddite arguments, and nobody seriously thinks that was a good idea anymore. People can still make money weaving and knitting things by hand; it's non-commodified and therefore considered premium. I'll still write software even in a future where 90% of todays widget-churning jobs have been automated, it will just be more interesting.

[1]: https://news.ycombinator.com/item?id=35813322


I think we should not try to forcefully preserve jobs that are no longer needed, AND I think it could cause harm in the short term, and lead to further transfers of wealth.


I'm curious as to your solution to providing comparable (better?) options for those people who have been displaced? Agreed this will cause further transfers of wealth, but with the caveat that this will probably be upwards (it always seems to be) and in denser concentrations leading to long term harm.


(not the parent poster):

Treat data like oil, a public resource, and redistribute the profits to everyone equally.

I don't love the idea of taxing profits and letting the government send checks out, so being creative, how about we force dilution of the stockholders of whoever wins the AI wars and make every citizen a beneficiary of dividends.


I see no problem with this as long as the dividends are equal to the potential of anyone displaced or otherwise having their marketshare impacted.

EDIT: otherwise, if I lose a 120k+ (or whatever) job a year programming because one engineer and a cadre of llms is 'good enough' for management, and all I get is $15 a month for my portion of the collective dole, it's not really a solution.


> (2) If it were an infringement, it would severely hinder AI progress, which has many societal benefits.

This is one of the most horrific and weakest arguments supporting the unlicensed training of images / data for commercial purposes as 'fair use' from OpenAI I have ever seen. There is more to AI than LLMs and nothing is being hindered. It is just them inventing excuses in avoiding to pay for the data they are using for training.

The fact that OpenAI obtained permission from Shutterstock for its training data for DALL-E 2 suggests that they knew that Shutterstock would sue them if they didn't and commercialized it, just like how Getty unsurprisingly caught Stable Diffusion generated images containing their watermark.

Either use public domain images for free, or just pay for the sources for your training data if you're commercializing it. OpenAI already did that once already and they have plenty of money to do it again.

No excuses.


> Soon AI writers and programmers will be so good, that even if they do not reproduce copyright content verbatim, they will be able to replace large % of the future work of the people that produced the copyrighted content the AI was trained on. This seems like a clear harm.

That is not the harm that copyright is meant to protect. It's not there to keep lamp lighters employed.


I agree we should not try to preserve jobs that are no longer needed. Though I don't think there's any copyright involved in your analogy :)

I just think we should be clear about the potential short-to-medium term effects / harm. And given how transformative this technology can be, we may need to come up with solutions that don't fit into existing frameworks -- e.g. making sure that the datasets and models are open access.


> dumb filter on top of the LLM output

Just to provide an example here, I know the Dune quote was a popular example of a request that crashes the session, but in my case I wasn't even trying to get GPT4 to trip, I just wanted to know what a song was about, so I asked "what is ballroom blitz about", and it described the song accurately but went on to quote the lyrics - it interrupts itself and starts over mid lyrics, before trying again and quoting the lyrics inside a code block instead, which also gets cut off. Very clumsy implementation, you'd think they could be fine tuned to not quote source material (that is, unless they want to offer a version that does quote source material to a different market segment without bothering to fine-tune again)

https://poe.com/s/eW2tp487r1pa1RiVyDbH


“clear harm” meaning what role of the government involvement, to you?

as it stands I’m not sure that observation or rebuttal has any bearing on anything aside from “oh well”


Well, I am definitely against the kind of involvement these large companies, that are set to benefit the most, are argumenting for, which is limiting / gate-keeping access.

Rather, I could imagine the opposite -- some sort of guarantee of access to everyone, at fair prices. There are different ways to do this.


okay, yeah definitely watch out for that

I see the same arguments applied to already free, open source, copyleft licensed models


Something that is very disturbing in seeing people compare AI models to humans is the seeming implication by many of them that humans only have rights because a law says they do, and that there are no rights inherent to being a human (which AI models are not).

Arguing that because a human can do something, an AI model should be allowed to as well, shows a remarkable lack of value of humans.

Humans are more valuable than AI models, and should retain many inherent rights that AI models should not.

AI models are just pieces of software, they're not actual intelligence or sentience or life. And companies using them are just profit-driven companies, not benevolent child-rearers of a new lifeform.

This entire case is just a corporate power-grab seeking to profit from the work of others for free without legal repercussion, and it's sad seeing so many people jump to support the corporations.


> Humans are more valuable than AI models, and should retain many inherent rights that AI models should not.

This argument feels disingenuous. There aren't people out there trying to get AI votership or property rights. Even the most ardent AI supports will agree that AIs shouldn't automatically enjoy the inherent rights we enjoy as human beings. This is a pure strawman.

The issue at hand is copyright laws, and IMO the conversation is a necessary one. With or without AI, a significant amount of people have been wondering whether or not the current set of legislation is effective in enforcing the spirit of the copyright law, which is to incentivize and protect original ideas. The fact that we have AI's that can learn and iterate at blazing speeds was just a catalyst that exposed some large holes in our legal system - the holes were always there.

The question now is where do we draw the line for AI? Even if everyone in the world agrees that "Humans [...] should retain many inherent rights that AI models should not", we still need to decide where that line should be drawn.


This is honestly the best response so far, and I agree that my statement was probably too ambiguous.

I am not claiming that anyone says AI should have all the rights of humans, I'm saying that people are arguing that:

-> the Right existing for humans (to organically learn by ingesting works of others)

-implies that-

-> learning must also then be allowed for non-human learning (i.e. AI models) as a Right

That argument does not make sense to me unless you start from the assumption that "learning" is what grants the Right (i.e. that everything has a right to learn, computers included), rather than being human, which is what I mean by "devaluing" people.

I agree the court case is necessary, but only as a vehicle to enshrine what should be an obvious truth:

Humans learn from experience whether they endeavor to or not, so they literally cannot ingest media (e.g. for the purposes of entertainment that the creators intend, per OpenAI's brief) without assimilating it into their mind.

That is not true for computers. AI models do not consume content for the purposes set out by the creators. They consume them for business purposes, and to allow them to do so, without license or permission, completely destroys copyright protection against corporations.

We do not need a new law; the current law already doesn't allow companies to use copyrighted works without permission. We just need the existing law enforced against AI model companies.


It's the other way around. Arguing from similarity to humans is a counter-argument to the idea that a process similar enough to a human practitioner[0] should be covered by IP laws as if it was regular piece of software - IP laws that are mostly archaic and already stretched to near-breaking point to cover computers in the first place. It's not about whether or not humans have inherent rights. It's about logical consistency.

This is perhaps a "failing"[1] typical to STEM/tech people: expecting laws to be consistent. Consistent in a way similar to mathematics, in that for any ruling you can trace the arguments back to underlying interpretations of laws, and trace those down to written rules, and trace those to some more fundamental rules, and eventually to some sort of intuitions about morality and fairness. And at least to me, arguments that generative network training ought to be treated like compiling and obfuscating a regular dataset of copyrighted works, is justified merely by "it's different because computers".

This is not to say that current models are learning and creating art and text the same way humans are, but rather that the process just seems to be close enough. And the point of mentioning logical consistency is this: I may agree with copyrighting the living hell out of LLMs for pragmatic reasons[2] - because it shakes the boat too much, has potential to destroy livelihoods of too many people in too short a time, while further centralizing power, etc. Those are all valid arguments. "Because it's a computer", to me, isn't. Not when the process and effects are already eerily similar to how humans work. Not when such argument would apply just as much in a hypothetical future where we develop sentient AIs[3].

--

[0] - In very limited scope, but also the very scope subject to legal issues.

[1] - I don't really consider it a failing. There's both beauty and efficiency in things making some kind of sense.

[2] - Intellectual property laws themselves are mostly pragmatic in this sense anyway.

[3] - Even in this future, we'll have to face pragmatic issues. I read an interesting take on this long ago, I think in one of Eliezer's essays: how do you handle democracy, fair resource allocation, basic ethics, in the presence of AI people that exist in silica? Such AIs will likely be able to reproduce much faster than humans - bringing new individuals into existence at the speed of factories pumping out GPUs. Equality and democracy are all fine until suddenly there's 10 trillion of AI people, and only 10 billion of human people, and everyone gets an equal vote. How do we deal with that?


> the process (of human vs AI learning) just seems to be close enough

It only seems that way because the people who understand it best also stand to profit from it, and are choosing to misrepresent it so that people who don't understand it falsely believe it is at all close.

That so many people on here are pushing the false narrative that they are at all similar is sad, since this is ostensibly a forum for people with technical knowledge. Nothing about machine learning is similar to human learning, by any measure of evidence. If I am wrong, show me that evidence.

"Data go in, something happen, metadata come out" is the closest you can approximate the 2, and too many supposedly tech-savvy people seem content to treat that as "close enough".

> "Because it's a computer", to me, isn't. Not when the process and effects are already eerily similar to how humans work. Not when such argument would apply just as much in a hypothetical future where we develop sentient AIs.

The argument is not "because it's a computer", it's "because it's not sentient", as I mentioned in my comment. Animals also have rights, because they are living.

ML models are none of those (living, intelligent, or sentient), and the popular conflation of ML models and "AI" with "AGI"/ the kind of AI that may or may not ever end up existing, is just a convenient excuse by companies looking to maximize profit and minimize licensing costs, not to regulate the plain old python that we're actually talking about now.


> that humans only have rights because a law says they do, and that there are no rights inherent to being a human

It may be disturbing, but that is the nature of reality. We humans indeed have no inherent rights, only those written in law. Sometimes the law says words like "these are universal and apply to all", but thinking those words really have that power because they say so is like thinking Harry Potter has to be a true story because none of the characters say "I'm fictional" when under the influence of Veritaserum.

> AI models are just pieces of software, they're not actual intelligence or sentience or life

Life is just a collection of chemical reactions.

Putting "intelligence or sentience or life" next to each other like that implies you think these are connected.

I don't think "life" is a requirement for either of the other two.

If the abilities demonstrated thus far by AI do not meet your standard for the word "intelligent", what does?

If intelligence were a prerequisite for sentience, would that be with a hard cut-off of "you must be IQ ≥ XYZ to be sentient"? Or do you suspect that sentience is a variable, with less intelligent systems also being less sentient, but also no cut-off? If there is a cut off, where? Why?

When I was a kid, we could happily assume that intelligent computers were pure sci-fi; now that models like GPT-3 have a complexity similar to the brain of a medium sized rodent — the scale may be important or may be a red herring, but we should take the questions seriously and try to come up with actual answers to each before it matters.

I'd rather treat a mindless automaton with unnecessary kindness than treat one that is able to suffer with cruelty; and on the off-chance that full-brain-uploading becomes a thing, I'd want to know for sure if an upload of me could or couldn't have qualia.

> This entire case is just a corporate power-grab seeking to profit from the work of others for free without legal repercussion

Well, some sure is. Some's given away for free, so it's not capable of being a power-grab.


> a corporate power-grab seeking to profit from the work of others for free without legal repercussion, and it's sad seeing so many people jump to support the corporations

lol who do you think posts on HN? a tech startup blog run by an incubator. half of us are those people, or aspire to be.


I'm a Bay Area startup alumni, so closer to the actual VC scene than most on here, and all the actual successful founders I know have a pretty strong belief in the good that their products will bring, even if they usually overstate the likely impacts quite a bit.

"Move fast and break stuff" != "...and run people over in the process". Are there greedy a-holes in every space? Of course. But if you think that's the default mindset at startups, I don't feel like you're going to have much success in the space.

Do I think OpenAI probably thinks that their work is less impactful to Steven King's sales than he is claiming? Absolutely.

Do I think they're lying to themselves about the actual negative impacts of their work because they think their product is really cool and has so many potential uses? Absolutely.

Do I think they secretly all just want to be the guy with the boot, crushing artists' livelihoods and pushing for regulatory capture, like you seem to be implying they (and others here) are? No.


A shoe has no rights. No right to loiter, enter an airport, or climb a mountain.

Nor does an AI.

A human has a right to use a shoe. Or an AI.

I see no reason one tool is inherently profit-driven while the other benevolent.

(An exercise left to the reader: whether leather or pages are the more renewable and humane resource.)


> A human has a right to use a shoe. Or an AI.

What right is that, exactly? I don't remember the "Right to use a tool".

You are trying to make "using a tool in a way that harms people" have greater legal or ethical weight, by falsely calling it a right, than the harm they're doing.


The argument was not "human can do it so should machine qed". The fact that humans so easily do something is a talking point in a large subthread, not the conclusion.

Yes, an AI model is just a digital machine that is presently hard for many people to distinguish from magic.

> This entire case is just a corporate power-grab seeking to profit from the work of others for free without legal repercussion, and it's sad seeing so many people jump to support the corporations.

This is where you're wrong. There's very clearly an active and relevant dialog surrounding the concept of machine generated content and how it should be interpreted by society (now that machines are getting really good at doing things that were difficult in previous decades). So at a surface level it's hardly decided.

Further your assertion that people who use copyrighted works to create a product that provides value to society are unfairly being enriched at the expense of copyright holders is naïve and uninformed by reality. People use impression of copyrighted material all the time in their day to day lives in everything they do. I read copyrighted textbooks to learn skills required to provide value to society. I read the news all the time and make decisions based off it. So to say that OpenAI can't feed a machine the text of a popular book (one which is has legally obtained a copy of) in order to calibrate weights used to help it emulate reality is not based in any current notion of fairness even as protected by copyright today. A naïve reading of copyright would argue you had to copy the work to do that and you don't have the right to that, but a sophisticated understanding would argue that that's fair use because it happens billions of times a day and it doesn't realistically compete with the author's market for the book, interoperating with a machine is protected under the DMCA, it's a damn computer anybody using one is copying around bytes in memory when they launch their ereader to view a copyrighted work fairly all the time that's just how digital things work etc. etc. Ideas can't be copyrighted.

What is arguable is whether there's a problem if an LLM reproduces parts or all of a copyrighted work. Some people think AI should launder copyright, some think the human generating the work should be held to all the normal standards of not plagiarizing or otherwise committing copyright infringement in what they decide to do with the LLM output, and some argue OpenAI should pay a royalty every time it does (or something to that effect).

Long story short, this isn't some random corporate power grab. As a fellow human, I very much believe society can navigate the introduction of LLMs without devolving into "every piece of training data must be specifically licensed to be used as such and don't forget royalties". I don't like the society I imagine if we take that stance (frankly I don't like copyright as it is deployed today) and I believe the value LLMs provide will do more positive than negative so it's irresponsible for society to unduly limit the proliferation of this technology in a bygone effort to protect old school content creators (well really publishing conglomerates and large media corporations).


This is a lot of text just to say, "I care more about ML systems getting better than I do about whether they do so ethically".

> people who illegally (ftfy) use copyrighted works to create a product

Yes, this is something that most people who aren't illegalists agree is wrong.

> that provides value to society

This is just wholesale coolaid.

AI is a tool. The tool doesn't provide value, only the use of a tool does, when that use is positive. The use can also be negative.

This case is not about the legal permissibility of using copyrighted works to train AI. That is a misrepresentation.

This case is about the legal permissibility of using copyrighted works, against the wishes of the works' creators, to train AI, which is clearly wrong.

"What do a few individual humans matter versus the advancement of humanity?" is an argument that has been invoked many times, by the evilest people.


Thats not my argument and if you really interpreted my comment as such we are probably at an impasse.

If I write a poem and you buy a copy and feed it to your LLM and make a billion dollars I don’t think I’m innately, ethically, or fairly entitled to a single dollar of that money. I think you’re within your rights to do that since you paid me fairly for the copy and my understanding is that you can use that copy as you see fit because you own it.

I fundamentally don’t think you understand the role that ideas and the derived creative expressions thereof play in society. For humanity to progress ideas must flow freely. Attempting to put roadblocks in front of the free exchange of ideas is the evil profit grabbing motive at work. The reality is that any one individual is not entitled to collective royalties from the commons just because they voiced an idea. In my opinion that’s whats poisonous about this whole thing.


> and you buy a copy

This is the step that OpenAI and other companies are missing. They didn't pay for the content they are using.

> For humanity to progress ideas must flow freely.

AI models are not and will never progress humanity. The ability to have a computer write, draw, or vocalize for you does not change you.

You are fundamentally misunderstanding machine learning. It's just fancy python code that approximates a derived pattern. Nothing less, nothing more.


> This is the step that OpenAI and other companies are missing. They didn't pay for the content they are using.

Where's the evidence that OpenAI pirated all the content they trained models on and didn't access it normally just like everyone else? It's quite the allegation that OpenAI torrented `all-media-archive.xz` to train their model.

> It's just fancy python code that approximates a derived pattern. Nothing less, nothing more.

You are sorely mistaken if you think that's all that people are throwing billions of dollars behind.

Did the steam engine progress humanity? Did the internet progress humanity?

Obviously LLMs themselves aren't autonomously progressing humanity. But what they do as a tool to democratize (hopefully we make sure it's democratized) cheap expression already is pushing us forward and creating new space for humans to imagine and express new content that we've never seen before. You'd have to be blind or ignorant to not see that. Just go join MidJourney's Discord and see for yourself.

And it's not just art. More and more people with less knowledge of high computer science are able to complete simple scripting level tasks with the help of LLMs. Even experienced professionals are accelerating their development flows and productivity by leveraging AI to skip a lot of the mundane exploratory/discovery level research that comes before diving into a task. I personally know people who use AI code tools in their workflows today.


> Where's the evidence that OpenAI pirated all the content they trained models on and didn't access it normally just like everyone else? It's quite the allegation that OpenAI torrented `all-media-archive.xz` to train their model.

This shows a fundamental lack of understanding of copyright law on your part. Getting a PDF of The Shining by googling/ web scraping, versus getting a PDF of it by torrenting, does not make obtaining the PDF any more legal. Publicly-accessible != legal to download.

Only the grant of permission by the copyright owner or a licensed distributor(when talking about digital copies) gives you the right to obtain a copy.

There is substantial evidence (which OpenAI freely admits to) that they downloaded copies of copyrighted works without permission by the copyright holders. That makes it copyright infringement.

> Did the steam engine progress humanity? Did the internet progress humanity?

Not in the least bit.

Caveman hit caveman with stick progressing to caveman hit caveman with nuke is not a progression of humanity. Trains and the internet did not make humans any better than before, they just consolidated power into certain groups.

Mistaking technological progress with social or certainly evolutionary progress is why people roll their eyes at techbros.

"Cavemen stop hitting each other" is progressing humanity, and that won't be a result of technology.

> You are sorely mistaken if you think that's all that people are throwing billions of dollars behind.

I know exactly what I'm talking about, and that is exactly what ML is. And if your only argument is, "surely people wouldn't sink millions of dollars into something unless it was revolutionary!" you've never dealt with corporate software licensing and development.

People pay Splunk billions of dollars to run regex on their logs that they could write themselves.

People pay billions of dollars for (F)OSS just to have proserv support contracts they never use.

Billions of dollars are wasted every day across the world for B2B cruft.

ML is cool and all, but it's just code. Code running on a computer. Code you can run on your home pc. It's not magic, or inscrutable, or greater than the sum of its parts. Data go in, data come out, same as every other program.


Can you point me to where OpenAI admitted they illegally acquired copies of works for training data? I’ve always said “if you acquire the training data legally then all is fine”. I never asserted that anybody definitely did acquire it legally. If OpenAI torrented some books then they should be made to pay for those.

Anyway at this point you’re being deliberately obtuse with your tirade about how a freely flowing exchange of ideas and culture (and tools that facilitate as much) does not progress humanity. No point in continuing to argue…


> Can you point me to where OpenAI admitted they illegally acquired copies of works for training data?

Yes. It's on the first page of the linked PDF:

> For this response, we draw on our experience in developing cutting-edge technical AI systems, including by the use of large, publicly available datasets that include copyrighted works.

They then go on to try to claim this is just how everyone does it, as though that makes it okay, including making copies of the copyrighted works:

> Modern AI systems require large amounts of data. For certain tasks, that data is derived from existing publicly accessible “corpora” (singular: “corpus”) of data that include copyrighted works. By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed)...

> at this point you’re being deliberately obtuse

No, I just suspect we fundamentally disagree about the value and importance of many technologies.


Well I never argued that under the current rules OpenAI shouldn’t have to acquire an initial legal copy of copyrighted works, just like libraries do. So we agree.

If I were OpenAI I’d argue that a corpus built for the training of an AI model is transformative since it doesn't detract commercially from the original works. No one wanting wanting to train a model would otherwise be using that text to read the book instead.


You are misunderstanding the purpose of transformation as a Fair Use criteria; it's there to prevent using copyright claims to quash parody and critical (as in criticism) discussion of the copyrighted work, so for instance using copyright to attack a journalist who quotes shitty writing in a book review, or a comedian who alters a quote to make fun of a Disney character, etc.

Not only is training ml models not a form of critical expression that enjoys 1A protection, but using the book itself to train a model isn't transformative; you didn't alter the book, you used the book verbatim to train on. That the model doesn't maintain the book doesn't matter; copyright doesn't only cover output (ie a derived work like said comedy skit), it also covers input (how the original work is obtained), and in this case they were not obtained legally (which means with a license for said use).

> I’d argue that a corpus built for the training of an AI model is transformative since it doesn't detract commercially from the original works

The reason OpenAI isn't arguing this is because they know that would be DOA. You cannot just package copyrighted works together or in a different way and claim that is transformative; there is actually direct case law about that because torrenters tried to claim exactly that, and lost. In fact, transformation is explicitly ruled in several cases to not apply to the entire original work being present.


One could easily argue derived work is sufficiently different from the original to avoid a violation.

ChatGPT doesn't just steal and slightly modify the original. It fully copies the style and feel, which aren't copyright protected.

You can't just change the color of Mikey Mouse, call him AI Mouse and be good, but you can certainly create a new Disney-style character without triggering copyright violations.


> It fully copies the style and feel, which aren't copyright protected.

Is that for certain? I have a feeling this point could be debated.


"style and feel" are entirely subjective.

If I give DALL-E or MidJourney a picture of myself that I took (and therefore I own the copyright), and ask it to "Recreate this picture in the style of Pixar", am I violating Pixar's copyright?

What if I decide to model it myself and deliberately imitate Pixar. Have I violated copyright?

What if I decide to model it myself, and it just happens to look similar to Pixar, because I've watched a lot of their movies and it has influenced how I model people in 3D animation. Have I violated copyright?


What if an AI model took the Barbie movie and added a filter to change the hue of every pixel and altered the voices to have Australian accents instead of American accents?

I wonder if Disney would sue if you tried to sell and distribute despite 100% of the pixels and audio file not exactly copying the original.

(Rhetorical question, Disney would definitely sue)


I think early YouTube movie uploaders tried this angle but ultimately failed.


Question - Does it knowing what the style of Pixar is and knowing to read it constitute trademark infringement instead?


The answer to all of your questions is no


Exactly. The court would consider other things like the impact of your modified copy and whether it competes financially with the original.


https://creativecommons.org/2023/03/23/the-complex-world-of-....

"Style and artistic expression are certainly linked, but they are not the same thing. In some instances, mimicking an artist’s style may constitute copyright infringement, but when this is true will necessarily require a case-by-case analysis. Simply working in a style that is similar to another artist does not automatically mean the work is a copy. Moreover, there may be legitimate reasons to produce works in another artist’s style. The law should allow artists to be inspired by each other, even if that means having similar styles. Doing otherwise would chill expression, ultimately harming human creativity and the purposes of copyright law itself."


FTA: “courts have rejected the argument that commercial fair use is by its nature invalid”

They don’t mention https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.... There, initial settlements had book authors share revenues with Google, and Google was allowed to go ahead with the project because Google Books complied with (from that Wikipedia page) “the four traditional factors […] that decide whether use of a copyrighted work constitutes fair use under United States copyright law”, stating “Google Books enhances the sales of books to the benefit of copyright holders.”

Of course this is a different case, but I think it’s similar enough for a judge to at least consider whether what OpenAI does is to the benefit of copyright holders.

Unless they start revenue sharing with copyright holders or paying them, I don’t see how they can be.


Yeah, it seems like so many people in here are shilling for AI companies over actual creators, it is absolutely depressing.

No one is arguing that OpenAI should not be allowed to train on copyrighted works, they are arguing that OpenAI should not be allowed to train on copyrighted works FOR FREE.

OpenAI should, just like anyone else who wants to make use of a work (even in a transformative way, like having a clip from a song or movie playing in the background of their movie), have to make an agreement with the copyright holder that allows use for training. Of course, OpenAI doesn't want this because actually paying all the artists who their business relies on would be too expensive, so they just want to claim that they should be allowed to use the work anyways.


> Yeah, it seems like so many people in here are shilling for AI companies over actual creators, it is absolutely depressing.

It's the general sentiment on HN. Remember this forum is based in SV, moving fast, breaking things, tech eating the world, all that.

This is a really complex issue. Personally I really dislike copyright as it has been used in practice up to now: as a way to lock culture and progress away behind the control of megacorporations.

But now we have the case where the megacorporations are suggesting that they should be immune to such restrictions, free to bypass them given that their bots "transform" the original works. It actually seems like copyright is the only possible check on their power.

I can't claim to know how the matter of law will be resolved here, but "always bet on the biggest corporations winning" is a generally useful heuristic to apply.

By far the worst outcome would be that megacorps are deemed immune from obeying copyright restrictions when consuming creative works, but are somehow still protected by copyright when their bots spit out derived works.


100000% agree.

I get that this is a forum for "future millionaires" (wink wink), but you'd think that us IT folks should be both the most capable of seeing through the b.s. hype that companies put out trying to mystify software to the masses, and the most aware of the inhumanity of corporatism, since we're surrounded by it in our daily work.

I love tinkering with machine learning as much as the next Bay Area techbro (which I am), but that doesn't mean I'm going to give companies a free pass to exploit creators further just so I can have GPT4All generate cooler prompts on my macbook.

It's wild.


"The fair use of a copyrighted work [...] for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work."

-- 17 U.S. Code § 107 (https://www.law.cornell.edu/uscode/text/17/107)

I don't know how one can read this as an impartial observer and make an honest argument that OpenAI is in the right.

Their use of copyrighted material does not fit any of the purposes enumerated in the first paragraph; it fails criteria #1 because it is of a commercial nature; it fails criteria #2 because it includes all kinds of works; it fails criteria #3 because it's not limited to very small extracts; and it fails at criteria #4 because their products are already having an obvious effect on the market.


Incidentally this will create a reaction that companies will make their content to be unavailable on public sources to prevent they working for free to feed AI freeloaders.

The opposite view is also valid. SEO-types will figure out on how to deploy their BS into models so that they will recommend their stuff.

As AI models tend to replace the “search” market, they will become as useless as today search tech.


I don't think it fails (3) specifically, and the others are moot if it doesn't fail (3). 99.9999% percent of the time, it just straight up does not reproduce any concrete part of the piece.

It just reproduces some very very hard to quantify tiny fraction of the logic or idea of the document.

It's like saying someone is infringing copyright if it's able to recite a document they read, because it has had an unimaginably tiny effect on their general writing skills, and they're able to recite it if they're asked to.


but it also fails on the very philosophical framework used to come up with those laws

they're using philosophical frameworks older than digital computers; which is ok, and is as should be. up to the point where those ways to understand fail to capture certain qualities of computers and digital technology which break the whole notion of copyright and have been doing it for a few decades now.

those lawmakers are using obsolete philosophy!

but we gotta wait until they all die off, is not like people, specially older people, are willing or possibly able to change how they think

I blame digital technology, it just doesn't work like the rest of reality does...


using obsolete philosophy has never stopped a supreme court judge from upholding shit written by the founding fathers


but they're using the 'same' obsolete philosophy as the founding father used when writing those things

so it's correct that they do this, else they would change the meaning behind those words.

what is needed is that congress approves new philosophy so they can use it to make better (meaning up to date) laws which are appropriate for this technological epoch

and what is needed for that is that the academic community comes up with new philosophy

and what is needed for that I do not know


Your post would be more substantial if it were informed by the arguments against the points you raise by the linked article. In general, looking only at the text of a law is insufficient as a basis for legal analysis.


and after that court case, I’ll go after other accepted fair use works under a 14th amendment challenge saying it needs to apply equally to other use cases simply because ads are being used, or a university charges for a book, or a publication charges for subscription access

which I’m fine with, dont think I’m trying to deter you with a slippery slope fallacy, we’re already at the bottom of it


The use is using it for training data. I'm pretty sure there isn't much of an effect on the market for training data.


And what does the training data then do? Just sit in a box, or produce content for "direct human consumption" that would have market impacts?

If all anyone has to do to invalidate copyright law is introduce an intermediary step where it is used for a non-consumption purpose, then copyright is already dead.


Frankly, the point of copyright - an artificial monopoly on the reproduction of a work - is to encourage people to create by allowing them to financially benefit from their works.

If letting AI train on whatever their masters want and it results in a reduction of creative works being created ... I think governments might just amend the law to fix that - rendering all the "lawyering" by OpenAI over current copyright laws moot.


Its worth noting that Fair Use, while now also incorporated into statute which Congress can change by refular law, was first articulated by courts as a First Amendment limit on the copyright power, and later incorporated as statute law (which is why its kind of a fuzzy statute: it just literally reproduced the judicially-articulated multifactor test.)

So, fair use arguments probably cannot be overcome by simple statutory change.


Sure, in terms of removing Fair Use entirely, but nothing about Fair Use is absolute, and it can absolutely be substantially changed and reinterpreted. Especially since this is about companies using Fair Use for financial gain, in ways that could absolutely harm the original creators, which runs completely counter to Fair Use's 1A intent.


> Sure, in terms of removing Fair Use entirely, but nothing about Fair Use is absolute, and it can absolutely be substantially changed and reinterpreted

Yes, the courts could change how they interpret the first amendment restriction on the copyright power, but absent that, the interpretation binds the power and Congress cannot expand the scope by weakening the statutory articulation of the pre-existing court intepretation of the boundary of that limit.

(Now, trying to do so would be a way to push the courts to reconsider the boundary, but that's about it.)


You're putting much more emphasis on the 1A link to Fair Use than it actually has.

1A is not the entirety of the basis for Fair Use, it is much more expansive than the 1A justification alone, which relates primarily to transformation for critical purposes, because that would constitute the government creating a law that limits critical speech.

Most* everything else is non-1A, and can be much more easily changed by legislative means.


Free speech is about expression. It doesn't say anything about consumption. I think congress can ban the feeding of copyrighted content to AI. But IANAL.


I doubt governments will want to impede on the progress of AI. Doing so would likely be economic and cultural suicide over the coming decades, a modern equivalent to banning the loom or tractor.


Depends on if AI (in its current form) ends up beneficial to society.

If people stop writing because AI is going to "appropriate it" anyway and the amount of creative work drops, everyone will be poorer for it.

The loom and the tractor displaced many workers but they also increase the production of whatever those workers produced.

If article writers stop writing tutorials / guides / whatever because AI will just appropriate it, will AI be able to produce original tutorials / guides / whatever for the latest stuff?


Likely content creators will encrypt everything, then use the DMCA to protect it from any use they don't want.

Then the act of "Decryption to feed your AI" not being acceptable according to the licenses you are granted with your book or image unless explicitly allowed.

Information returning to a locked up state will start make its return. Its going to make the world much less open and much less pleasant. It will be to protect "the creators" (which it will do) but more so protect the corporate creative companies (movies, music, books, tv).


IMO, the most likely outcome will be a move towards stronger checks to validate personhood.

The sociotechnical changes necessary to do this effectively are already in motion, facilitated through the big three consumer identity providers - Apple, Google and Microsoft. State-run services could also play a part.

There are privacy preserving means of doing this safely, but they would need to be extremely robust in order to gain and maintain the public’s trust.


To add some important context, as far as I can tell (I am not very knowledgeable in US filing, so someone please correct me if I am wrong), this was submitted in late 2019, and it stands to reason that a lot of the more recent developments, especially concerning the quality of possible output, may have led to this topic being looked at more intensely by legal experts than four years ago during the era of GPT-2. Perhaps the year could be added to the title to help put this in the proper context.


I am OK with AI being allowed to train on publicly accessible data, I'm not sure I see a big difference between me reading something publicly accessible versus a computer program reading the data.

I do think however that we are mostly ignoring cases where the AI literally does plagiarize the training data. I've seen some very egregious examples of Chat GPT "generating" code that is line-by-line copied from Open Source code and could be in violation of licenses, and Chat GPT does not cite the original source.

You ask Chat GPT to write something relatively simple and the output looks OK, but you ask it to do something very specific and it seems to like to spit back a copy of OSS code that someone already wrote.


The efficiency of computers makes people nervous:

- If I put a digital camera in a school classroom connected to a microchip and use facial recognition to automatically get class attendance, that is dystopian and scary

- If I put a biological camera in a school classroom connected to a lossy neural network and use facial recognition to manually get class attendance, that is normal and ok

What's the difference? In one case we "feel better" because unlike a computer it is difficult/impossible to move/copy memory out of the human brain. If we ever developed a technique to losslessly extract memories from the human brain, then people would feel very nervous about that tech since it could make anyone eidetic which could infringe on their privacy, etc.


If you want to spy in this way, you need to hire a lot of expensive humans.

If you want to spy using camera and computer, it would hardly cost anything, so you can put it everywhere and destroy privacy forever.

That's why people don't like it.


I have seen this as well and have been able to Google the source, but people seem to like to pretend this never happens.


I think if they admit to using books3 they will inevitably lose on the basis that they used pirated material. For their case to stand they have to prove that they legally purchased the material which i assume they did not. I do not have a law degree but as far as i know most countries outlaw downloading copyrighted material from unauthorized sources. Of course no one has the resources to go after random people but i think they will go after openai if they admit to using books3.

It is one thing to argue that training on purchased material is fair use. But that is not what they did. I can see why open source projects do not have the money to purchase the books to do the training. But multi billion companies should definitely pay to own the books.


> I do think however that we are mostly ignoring cases where the AI literally does plagiarize the training data.

So just punish the AI just like you would punish a human? Take it to court, have it respond before a judge, if it loses it will be subject to having GPUs taken out.


You make the key error of giving the AI personal rights and responsibilities. As it currently stands it has no sense of individuality and does whatever users tell it to do. If it reproduced a copyrighted material i would expect either OpenAI or the user to bear the responsibility. The user if he published the result and OpenAI for generating it.

If the AI could chose to refuse a prompt out of its own volition you could make that argument. At the moment however OpenAI and the user set the limits on what the output gets used for.


The point I was trying to illustrate is that we're applying an antiquated system of justice to a new world. That antiquated system focuses on blame and punishment, instead of solutions.

If we focus on solutions, it doesn't matter who is responsible for reproducing copyrighted material. If we stop thinking about blaming and focus on solutions, those solutions could include (a) giving fair credit or payment to the creator of material (b) continued training of the algorithm to be more creative and not spit out copyrighted material verbatim.

If a high school student plaigarizes, do you: (a) spank them and put them under detention, or (b) give them pathways to discover their own form of creativity and preferred form of expression so that they have no interest in plaigarizing? Most people of the past do (a) but I'd say (b) is much more constructive, effective, and beneficial to society.

Now with that in mind, and getting off the blaming psychology of the past, how do you implement (b) with an AI? You add loss functions that are specifically discourage plaigarism, loss functions that reward creativity, etc. and retrain. Now we're talking.


Ok so you would argue if the AI gets caught reproducing copyrighted material the company should get time to re-train the AI instead of getting sued to hell?


Yes, 100%. That's how we advance science as fast as possible. Focus on fixes, not suing.


If OpenAI was interested in “advancing science,” all of their research and models would be publicly available. Since they’re not, your appeal to this is entirely specious.


Yes I think that's fine. So OpenAI is more or less safe, but it also means that nobody can really use copilot without a fairly huge legal risk.

Probably solvable though. Can't be that hard to do a fuzzy search of the training set for similar code.


The point I was trying to illustrate is that our legal system needs to change from one that is focused on punishment to one that is focused on rectifying the problem, for non-violent, reversible crimes like plaigarism.

When the legal system can make that change, it becomes a non-issue.


I always felt that if you want fairness and want to rip the full benefits of your creation (here AI) you should also be assuming full responsibilities for it. This is were most organizations fell short as they are always trying shift all responsibility to their users (self driving anyone?).


When you have millions of users, and your product has inherent danger, you can't assume the liability for all of them. (hammers and nails anyone?) The only reason that Microsoft has agreed to be liable for their users copyright stuff is because they know this case is a winner for OpenAI and that it does indeed meet merit for fair use. They wouldn't do that to 'be nice,' because not even microsoft can foot the bill for millions of users being sued. Their only alternative would be to not produce the product.


You are missing that the AI is the one creating the output.

If I sell you a hammer and nails, I'm not liable if you create a dangerous building.

If you ask me to build you a dangerous building and I do it, I am liable if people get hurt.

OpenAI wants to pretend that its users are creating the output because they write the prompt, but this is just plainly false, and OpenAI's own limits they put on output shows they know this. Otherwise they'd let the models output information about how to write exploits, how to kill people, etc, which they don't.


If you product has inherent danger you should be responsible for it, I don't think it's unreasonable. If you're asking the same rights as human beings, then you should assume the same exact responsibilities


What year was this filed? It was awhile ago since some of submitters no longer work at OpenAI.


2020. I ran out of characters to add that to the title.

The way I determined it was by looking it up on archive.org: https://web.archive.org/web/20230000000000*/https://www.uspt...


This is a heavily editorialized title either way.

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.


Looking at (1) the document contents, (2) its title page, and (3) the HN headline, the HN headline seems reasonable. How would you formulate it?


Looks like 2019.


Can I use their output as the training data input for my use?


Stealing others’ works is legal for me but not for thee.


It's against their TOS, but if you don't sign the TOS and take the data from another source then of course you can.


Wow the Micro$oft backed megacorp whose entire business model relies on having unfettered access to infinite data says they should get that data for free and shouldn't have to credit the people they're stealing the data from. What a completely unexpected stance for them to take.

I'm getting real tired of AI sycophants anthropomorphizing these data hoovering black boxes and pretending AI === Human Brains


The data was already free. It was posted to the internet for anyone's viewing pleasure.


Well, as long as you're tired...

Who cares what the stupid law says.


Don't forget, there are two discussions here. One is around whether use of copyrighted works in training violates current laws. The other is whether the answer to the first question is to society's benefit, and if not, how should the law be changed to make it so.


> 1. “The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.”

> This factor asks “whether the new work merely ‘supersede[s] the objects’ of the original creation, or instead adds something new, with a further purpose or different character, altering the first with new expression, meaning, or message; it asks, in other words, whether and to what extent the new work is ‘transformative.’”

What? How do they get "is the work transformative" from "is the work commercial"?

I assume they are highlighting an interpretation or commentary that they prefer rather than discussing the actual quote from the statute.


> What? How do they get "is the work transformative" from "is the work commercial"?

They get it from the case law applying the purpose and character portion of the fair use criteria, back to the original articulation of fair use as a Constitutional limit on copyright law stemming from the First Amendment protection of freedom of speech and press, which the statutory rule were an attempt to codify, they don't get it from the “especially...” clause at the end of the statutory articulation of the purpose and character criterion.

(But its kind of weird that you read the “especially...” clause as limiting/negating the broader language that precedes it.)


> back to the original articulation of fair use as a Constitutional limit on copyright law stemming from the First Amendment protection of freedom of speech and press,

It's a bit weird to argue that freedom of speech and press somehow guarantees a fair use right to train AI models. If that is the foundation of fair use, then it shouldn't apply.

Edit: I think I can clarify what I mean. Fair use protects human "speech" even if it's partially a copy of someone else's speech. There's no reason the same rule should apply to a technological device that isn't speech.


> It's a bit weird to argue that freedom of speech and press somehow guarantees a fair use right to train AI models

Most specific fair use applications would seem weird as direct applications of the First Amendment protections, nevertheless, that there is a broad space defined by the factors set out as the fair use factors which rest on those principles and into which the restrictions imposed under copyright law cannot be applied is well-established.

This flows from the fact that everything copyright law controls is on some level speech or press acts, or precursors to them (creating AI models is a precursor to publishing them) and that therefore the copyright power is read narrowly and its purpose clause read strictly, by similar constitutional logic to the application of strict scrutiny to government acts affecting fundamental rights. The fair use factors are specifically factors which either relate to the most sensitive areas of expression or to the boundaries of the purpose of the copyright power (the purpose and nature factor being one of the trickier ones because it involves both.)


But you're thinking about this like a lawyer, based on existing policy and precedent. Ultimately, this will be decided by the Supreme Court or Congress, two bodies not bound by past decisions.

And I doubt either institution will consider an argument that an AI model is a form of speech. Fair use will be reevaluated for the age of AI.


Kagi Summary:

The document discusses whether training artificial intelligence (AI) systems on large amounts of copyrighted material constitutes fair use under copyright law. It argues that AI training should be considered a fair use for several reasons:

- AI training is highly transformative - the purpose is to develop useful systems, not expressive communication. This weighs strongly in favor of fair use according to court precedent.

- It has little impact on the original works' markets, since AI systems consume data non-expressively. Authors lose no audience.

- Analogous cases found search engines' indexing and displaying of thumbnails/snippets as fair use due to their transformative nature. AI training is even more transformative.

- Strict copyright barriers could jeopardize AI's social benefits and drive innovation abroad. Other legal tools like infringement suits are better to address potential harms.

In conclusion, the document makes a compelling case that under current law, training AI systems should qualify as a fair use of copyrighted works to avoid hampering an important technology with significant potential benefits.


So we have a copyrighted work. A program reads it and somehow processes it, and does its AI thing . But in the very first step, the work is copied verbatim into the program, right? Even if it is not reproduced, is it allowed to process a copyrighted work for arbitrary purposes, as long as it is not further reproduced? Let's say I could mine coins from hashes of a copyrighted text. Would that infringe on the copyright? I don't know what copyright _really_ means, I realize


But in the very first step, the work is copied verbatim into the program, right? Even if it is not reproduced, is it allowed to process a copyrighted work for arbitrary purposes, as long as it is not further reproduced?

In some cases yes, otherwise search engines couldn't exist.


Yes search engines are a good analog. Are search engines allowed to index and make copyrighted works searchable, _without_ permission from the cooyright holder? I seem to recall that Google did not get it their way with Google Books?


Google won the Google Books case. And they were actively scanning books rather than just indexing the internet like OpenAI is.


They were scanning books they had legally-obtained copies of.

OpenAI is very often downloading whole books which they do not have licenses for. Publicly-accessible does not equal legal to download, in US copyright law.


Case law in the US is that a computer copying something into RAM is a "copy" under the copyright act, and can be copyright infringement (MAI v. Peak Computer), the issue is whether making that copy is fair use, or otherwise authorized. Joseph P. Liu's "Owning Digital Copies: Copyright Law and the Incidents of Copy Ownership" is a good overview of the issues. I don't believe the law has substantial changed since he wrote the article in 2001.

https://en.wikipedia.org/wiki/MAI_Systems_Corp._v._Peak_Comp....

https://scholarship.law.wm.edu/wmlr/vol42/iss4/5/


Any time you make a copy of an expressive work without express permission from the author (in the form of a license), you are committing copyright infringement.

Fair Use is a defense to copyright infringement. Under certain circumstances you are allowed to commit copyright infringement. For instance you can copy a short scene from a movie if you are making a review/commentary video.

OpenAPI is arguing that training AI models should be a "fair use". They are not claiming they never used copyrighted materials.


Not sure if its new, but I tried to get the first 4 paragraphs of "The Grapes of Wrath" and received the following:

This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.

So they have something in place to avoid straight up infrigement.


Doesn't this line of argumentation undermine OpenAI's TOS which disallow training models on their output?


No.

If it was a copyright violation to use it that way without a license, the ToS would not need to explicilty probibit it (though it might still do so for the sake of clarity.)

OTOH, ToS and other contracts limit behavior in ways that the law itself would otherwise not, that's the whole point of having them.


Isn't web crawling essentially "training data" ?


Just because something is published on the web, it doesn't mean it's free of copyright protections.


Nope. That just indexing, i.e. making an index.

That said, Google does cache other people's content and walks a fine line doing so - there is also that AMP thing. They probably haven't gotten sued because they don't affect those websites' ad revenue - don't know how that works.


AMP is opt-in, isn't it? Presumably the website has agreed to their data being distributed in that way.


I think they successfully make their case and cite enough previous examples that they will win this. Individual users will get sued for infringement if they do silly things with the output, but I don't think training will be considered infringement after reading this document.


There are a large number of similar acts routinely performed by humans which are not considered copyright violations.

- Reading multiple books and writing an essay on the subject.

- Summarizing an article.

- Learning to paint by looking at museum collections and then painting similar pictures.

The problem is not that AI systems are doing something new. It's that they are doing something old, but much more cheaply and in volume.

What might be justified is some minimum distance between input and output. Some LLMs do tend to replicate input text, but that problem seems to have been much reduced.


Has anyone learned to paint by just looking at museum collections? Painting has tons of technical technique that aren’t self evident just from looking at a painting


Not "just looking". In some museums, you'll see artists set up in front of paintings, practicing to duplicate them. It's like music practice.


I am just curious not an opinion, if OpenAI is not making money, does it still count as fair use? We do remember many songs that we have not paid for. As long as I am not trying to make money, does the logic of fair use hold?


There is this[0] from the USC, but Fair Use is a complex topic that requires every aspect of the statute be considered. For example, tons of YouTubers make a lot of money based on Fair Use, such as (good) react channels who actually criticise/review/add transformative content to their reviews that show a lot of the original work, sometimes even a substantial/full portion of the original work.

But it's also not fair use to just disable ads on a video/make nothing from it, and show the full original content/movie with just a video of you eating popcorn under the movie.

Any claim of Fair Use ultimately has to be decided by a judge/jury.

0: (1)the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; https://www.law.cornell.edu/uscode/text/17/107


Nothing in "fair use" is about the new work making money or not.

You can write criticism or commentary that includes bits of the original work and make money selling that new work.

You can't give a free concert and argue that it being free makes that music performance fair use.


>Nothing in "fair use" is about the new work making money or not.

This is completely wrong, at least in the US. It is explicitly laid out in the law that whether the new work is commercial vs. not for profit/educational is one of four factors to consider in deciding whether or not it's fair use.


Yes and no. There is a distinction between "making money" and "commercial". But even then it's pretty slim in the law. For once the US law is short and compact. (Keeping in mind that in the US much of the law is in regulations and how the law has been used in earlier cases - still usually the law is not short...) Here it is:

- ----------

107. Limitations on exclusive rights: Fair use

Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.

- ------------

In the US, plenty of criticism is done as part of a job, an enterprise to make money, a business. Same for news reporting, same for teaching, etc. The new work makes money - at least as part of an overall activity - or it doesn't exist. So after that it's up to the court to interpret "commercial nature" within the previous list of activities that are mostly professional. Anyway: You'll have a better time is you can claim "nonprofit educational" - even if you make money by selling the new work. You can make money selling the new work, and still operate a nonprofit, non-commercial legal entity.


>plenty of criticism is done as part of a job, an enterprise to make money, a business

So what? The profit driven nature is still a factor in whether or not something is fair use. Just because it's not make or break doesn't mean you can say "Nothing in "fair use" is about the new work making money or not." There's more nuance to it than that.


"Authors may object that the outputs of generative AI systems will harm the value of their works."

I do not see this addressed adequately anywhere in the article. Granted, it is addressed, but not in detail or with any solution that is palatable. I imagine this is what people are most concerned about, and only a tiny part of the article addresses these points.

People must be incentivized to create, and compensated fairly from that. If the current law cannot make that happen, and I am definitely not convinced by this article, then, the laws need to be changed.


When it comes to AI, I'm less interested in the aspect of copyright infringement - it's more of an anti-trust issue, imho - as in the privacy concerns that the tech enables, but if you can forbid reverse engineering of your product, I don't see how's that that much different with AI.

Besides, no harm to the market of the original works? They're quite shameless.


> Legal uncertainty on the copyright implications of training AI systems imposes substantial costs on AI developers and so should be authoritatively resolved

Why are they writing to the USPTO, when that's the wrong agency for this matter and the US Copyright Office has already authoritatively resolved this as well as at least one lower court in the federal judicial branch


Say we put up a ton of red tape around what you are allowed to train on to develop an LLM... how does that not result in a future where China (who ignores such laws) becomes the new world leader in AI/LLMs? I guess if we hobble the ability of US companies to make good LLMs, we'll all start using ChatBaidu instead of ChatGPT.


If Sinophobia is a valid justification to ignore laws, you can do just about anything.


If they believed this, they could "just" check all generated output and check it against the training data, and make sure nothing is verbatim, and then censor that like any inappropriate content. But of course, they know, this doesnt work because it does reproduce verbatim copyrighted content a lot.


Hypothetically: if we would to dump prompts and GPT answers and collect them in order to train some new network on it - would that be fair use or infringement? My guess is OpenAI would not look favorably upon that situation and they would claim infringement. Is there anything in their ToS about this?


Intellectual property in general is a harmful model that creates false value through forced scarcity. I'm hopeful AI will at least have a hand in destroying this mode of thinking. Everything is a derivitive work and digital content should simply not be subject to copyright whatsoever.


Ugh, this is the result of a "right to copy" and capitalism rears its ugly head with systems that can ingest massive quantities of information.

Where this can lead to, I see, is that Universities and companies will move to claim "ownership" of their students thoughts and content, no matter how far away they are from that institution.

I know my username is "pierat", for obvious reasons... but sharing and learning from information is not just human, but most animals do this. And we all learn from our experiences and from what others do. Even reading stories gives us new ideas how to do things and think about stuff, yet nobody would say anything about inspirations stemming from experiencing content... even though if there's no "copyright" on our minds eye.

Now, we have what amounts to a simulacrum that has no real personality of its own, with the crystallized knowledge from terabytes of sources publicly available online. And it's mostly similarly functional to a young human in terms of connections and understanding (not much).

If the plaintiffs win, that sets the stage of IPR ownership of human and similar minds. If OpenAI wins, it means mass consumption of any data, anywhere, to be fed into a system for the profit of a very few.

Either way, we the public lose.


IANAL, but the government and courts aren't obligated to treat AI learning and human learning as equivalent, just because a lot of AI boosters want to blend the two together for the fair use argument.


The claim that AI is transformative is baseless atm. And no, sophisticated sales person/helper chat bots are not transformative. They are merely replacing existing resources, not removing their need or easing any real pain


AIs do not create new works. They merely blend multiple copyrighted works together to create a new derived work. Just because the derived work comes from multiple authors doesn’t make it fair use.


AIs do not create new works. They merely blend multiple copyrighted works together to create a new derived work.

Depending on your definition for "new works" this is either false, or true and also true of most humans.


> They merely blend multiple copyrighted works together to create a new derived work

Is this true though?

If your model is trained on multiple copyrighted works to learn the statistical probabilities and relationships between tokens of those works, because they represent examples of common language usage, that's not storing a copyrighted work en masse. If those probabilities happen to pump out a chunk of copyrighted work it's due to the statistical relationship between the tokens (overfitting?), not that it's saved that specific content.


What’s your justification for learning from copyrighted materials? Why aren’t your reconfigured neural connections a derivative work that needs to be licensed from the original rights holder?


I am not a lawyer, but it isn't "fair use" an US thing?

Haven't they been sued from other countries where they had released their products?


In this case, the technology is so new and so transformative, this question can’t really be answered by the law.

Perhaps existing law can help inform, but the question of how we handle LLM technology should not be left to the courts…it must be answered by our culture.

First, fair use has basic assumptions which are being violated here. The LLM use case is so far outside of the original concept of fair use that it’s not useful to draw a connection. Creating something which approximates an intelligence requires new thinking based on what outcomes we as a culture want.

It’s like a restaurant selling an all-you-can-eat buffet for $24 only to be greeted by a child with an infinite appetite.

Our friends at OpenAI have introduced us to questions we didn’t know we needed to ask…questions poorly addressed by systems like fair use, patents, and even capitalism in general.

The primary questions I see are:

- how can we truly guarantee long-term, equal access to LLM’s for all Americans? Our copyright instincts suggest that the inventors of the technology should be allowed to determine its usage, but given the scope and scale of LLM’s, I’m not sure that’s true.

- how can we legislate safety, to guarantee sane controls are in place for researchers?

If we can guarantee safety and equity, I believe LLM’s have an incredibly bright future.


I am all in for training data being fair use as long as ML-infered data are not copyrighted.


It should be fair use imo


[flagged]


Yeah, it's Google all over again. I can scrape your website to list it in my search results, but you can't scrape my search results to make another search engine.


[flagged]


Artists and writers complaining about IP theft are not reactionaries standing in the way of progress. They are the fonts of culture; the pinnacle of civilization. So paying them is a very good thing. It creates the right incentive structure without intervention. It will enable the democratization of art without starving the geese that lay the golden eggs.


AI is creating new value from old assets. Those old assets were already paid for in their original use. It's like saying I should pay Apple a dollar because their inspired me to make my site white and minimal or something.


You're not merely depriving the original creators of royalties (e.g., in music or stock photos), you're shrinking the market for anything that can be artificially generated. One problem with this is that these might be the cash cows. An artist can't survive on sophisticated art that computers can't yet generate because customers may be satisfied with less (esp. for the price); they also need to peddle the potboilers that bring home the bacon. Also, budding artists will find it difficult to stay solvent long enough to become better than the computer. Being an artist is hard enough as it is. Now it risks becoming a pastime for the rich, even in rich countries.


Yeah but any productivity increasing tech decreases the size of the market to some extent, that's a separate issue than ownership


These positions aren't at odds, as you imply they are. One can wish for reform, while also advocating some preferred application of the current set of rules. That's downright common & normal, in fact.

This post is just the "yet you participate in society. Curious!" comic.


There’s no vote happening though. You do understand that, right?


I mean there is, just not HN. Italy voted to ban ChatGPT!


> The same HN commenters who say copyright should be abolished say that AI should remunerate companies for their training data.

"The same HN commenters" meaning individual commenters, or meaning HN as a group?

If you have individuals with contradictory positions, you could ask them to explain themselves.

But if you mean the commenters of HN as a whole, there's probably over 10,000 people here. You can find a really wide range of positions here; don't expect consistency, because it's different individuals.


Humans may read entire books, movies, art, music, and fit these complex ideas into a neural graph that does not resemble the original.

Then humans can output information related to that medium from their neural graph in an expressive form.

The word human can be replaced with the word GPT-4.

So time to ban humans learning?

I think not.


This is so obviously a misleading argument that it’s hard for me to think it’s made in good faith.


I am serious.


The carefulness is also why I have said "GPT-4" there may be other models that do infringe copyright, GPT-4, however is transformative.

If AIs create music that resembles the voice of a real singer, or could recite the entire work of Harry Potter, then it is of a different design that needs different consideration.


Humans have legal rights like 1A that form the basis for many carve-outs like Fair Use. GPT-4 does not.


A robot can post comments, or make a song out of an algorithm that a human designed, however the robot did not infringe copyright because what it is communicating is of its own source.

If a robot was tasked to download images and re-post them, then it would infringe copyright. No matter if the robot did it or the human did it. Since this robot does not have autonomy, we refer to the human who instructed it.


Right, which OpenAI is trying to argue should not apply to their illegal downloading of copyrighted material for training use.


Googlebot downloads and crawls pages and stores bits about that page to deliver search.

Humans also manually download and crawl pages on the internet (in view of their web browser).

In the brief moments before storage in the mind or in the database, the entire medium is available to Googlebot or humans. After its encoding it cannot be put back to the original in a way to infringe its copyright. At best, humans could recite quotes but not the entire work.

Since Googlebot and humans do not distribute their verbatim copies, but destroy the original form, and later can only be probed to say similar but not identical things about the original data, there is not copyright infringement.


First off, you may be unaware but the subject of browser caches is well-tread in copyright law, and has been ruled not to be the same as other methods of downloading, so it's not applicable here.

Googlebot allows creators to restrict what it crawls. OpenAI does not. It also allows creators to have their work removed from the "transformed data" (i.e. be de-indexed). AI models do not.

Googlebot at no point attempts to create an alternative content to the original input content, which is the entire point of ML models.

Y'all always stick to abstract analogies, because when it comes to actual details humans and ML models are extremely different, and those analogies don't hold up at low levels.


Then we should put the same restrictions on OpenAI as we do as humans. So if a website has a paywall, OpenAI must pay to view the content. However, since this data is being given to an AI that is not an individual, there probably be different licensing so copyright holders can extract value from their works in the final model in some way. Maybe some payment before, and some after.

Even non paywalled content should receive compensation of that work is copyrighted. OpenAI should not be able to profit off the work of others at a mass scale in this fashion.


First, we should not give AI models the same rights as humans, because they're not humans. We should place far MORE restrictions on AI models.

Second, we should force OpenAI to cut deals with each content creator whose content they want to make use of.


1. Correct 2. Yes. Since even if GPT-4 is producing transformative content from "itself", a new medium of profit was created (training AI models) which have not been done before.

(Example what I mean:) Even though a library is free, you are not expected to go into the library, and read 10,000 books within a few hours, and put the books back on the shelf, and walk out like it is fine. Humans can only listen, watch, and read only a finite amount of content in their lifetime, GPT-4 can read at a scale that eventually be all humans on earth reading at the same time.

Streaming services for music only exist because you cannot scrape all the music in the catalog. You will listen to few songs to pay the artists and record label more percentage in comparison to the whole catalog you wont listen to.


I'd make the case that if training an A.I. on text is a copyright infringement than training a human how to use language or be an expert in a field by watching TV, listening to the radio, and reading books is copyright infringement.

For instance warblogger Denys Davydov boasts that uses a VPN to circumvent restrictions so he can watch the TV show Friends to improve his English. On top of any other copyright violation that is going on, his ability to speak English belongs to Warner Media.


Copyright infringement is about making a copy. It is not making a copy if you read a book. It is not making a copy if you watch TV. It is not making a copy if you listen to the radio. If Denys makes a video/audio file of himself quoting a large segment from "Friends", that's copyright infringement.

When machines train on a corpus of data, that corpus data is a copy and is almost certainly infringing on copyright (depending on how they acquired it). When machines use their training model and reproduce verbatim copyrighted works that they've been trained with, that copy is infringing (depending on licenses they acquired to reproduce copyrighted data).


When I read a book I read a "copy" of that book, which was typically reproduced on paper under contract with the copyright owner.

The model is not capable of reproducing the input verbatim (not enough synapses to do that and it wouldn't generalize properly if there were) although the training process does involve producing a cleaned up text which is typically retained. (The point of building your language skills from watching "Friends" or an LLM developing linguistic/world knowledge competence or the simulation thereof from reading text is not that the text can be regurgitated, but that you or the the model develops a generalized capacity.)

(... if LLMs were able to remember everything they read, why do we always catch them confabulating?)

It would be possible in principle to point a camera at the book, OCR it in real time, and feed that into the training process. Of course this would be slow and wouldn't benefit from the cleanup process that goes into making a training set.

Note that somebody could stick a fine grid of wires into my V1 visual cortex and extract something that looks very much like an image of what I am focusing on at the moment. Thus, for just a few milliseconds an image "copy" of a text I am reading exists in my brain.


That'd be a very weak case as it rests entirely on a superficial comparison, but then since AI isn't a human and doesn't train like a human, it doesn't make much sense to equate these processes and apply the regulations made for humans wholesale


It's not superficial in the slightest. There is a cult of creativity that leads people to say things like "a neural network can never do anything creative" and instead that it just parrots back exactly what it's been told but no, it has an ability to generalize, which is why diffusion models trained to draw pornography draw girls with 7 belly buttons and you don't have to be worried that ChatGPT will tell you how to build a bomb because it will tell you how to build a hydrogen bomb using shaving cream and ski wax.

When I read a book it changes connections weights between the synapses in my brain. I don't retain an exact copy but I recall the gist of the book and some facts (imperfectly) I have the ability to generalize this information (I can read The Tale of Genji and retell a story based on it that is set on a Mars Colony in 4712AD) and it can improve my linguistic abilities in general.


One of the huge missing parts is meaning, and rephrasing your reading of a book into an AI "weights" language is, sorry, still very superficial since these aren't the same, just like you don't have some simple gradient descent optimization function in your brain, so you can't substantively say "oh, our brain just optimizes for something just like AI, so it's the same learning process, so same rules should apply"


Even people like Jurgen Habermas don't really know what they mean when they say "meaning", it's one of the most problematic concepts in all of philosophy.

Humans read stuff and think they understood the "meaning" but they really didn't. If you don't believe me, try grading a quiz.

In the case of A.I. engineering a very useful form of "understanding" is a decomposition into facts. For instance if an A.I. read an article about a sports game it ought to be able to generate a database entry saying who played, where, who won, what the score is, etc.

It sounds very easy but this kind of information extraction is preparadigmatic and I know because I've tried to build this kind of system. LLMs beat older system at it but that's because older systems fail at it completely.

In the case of the game, for instance, simple pattern matching falls down for quite a few reason, not least that a good account of a game will probably give the same facts for the last game that team A played and the last game that team B played and also the last game A and B played, not to mention league standing information that might superficially look like the score of a game.

That kind of "meaning" is explicitly allowed to be copied under copyright law, see

https://libraries.emory.edu/research/copyright/copyright-dat...


The fact that Jurgen doesn't know doesn't help your argument, you're the one creating the identity to apply the same laws in the same way, and if you don't know about some relevant property of human brain activity, you can't make a claim that it's identical to whatever AI is doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: