"Fair use" doesn't generally include taking the author's entire work (or for tha...

rpdillon · on Oct 5, 2023

The models that are trained on these inputs absolutely do not contain all of the text that they're trained on. Even with non-quantized models you're looking at somewhere in the realm of 3% of the training data being present in the final model.

Further, these models are not used to reproduce copyrighted works... they would be one of the least efficient ways to do so.

I also don't think there's a particularly strong argument that training on a book or image decreases the market for the original book or image. To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

I think what's really happened here is that artists' work has utility outside of the originally intended audience, that utility largely comes from statistically analyzing the writing, and the debate is about whether or not the utility of those statistics requires additional compensation for the artists. That question seems to hinge on the intricacies of copyright law, but as I've stated before, I think the fair use argument is quite strong.

pbhjpbhj · on Oct 5, 2023

>I also don't think ... that training on a book or image decreases the market for the original book or image. ... zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones ...

For now. But when you can generate work in the style of an artist without paying that artist, I would. More importantly, media production companies will and so will use access to people's post work to put them out of work now.

We need to distribute the benefits, we're drifting towards copyright infringement being allowed by rich corporation (with resources to generate "AI") as long as they abstract it through a transformer, or whatever.

antifa · on Oct 6, 2023

Crypto was a decentralized ponzi scheme. AI will be decentralized mass copyright infringement.

artninja1988 · on Oct 6, 2023

Copyright laws are fascist and should be abolished

Turing_Machine · on Oct 6, 2023

> But when you can generate work in the style of an artist without paying that artist, I would.

You can't copyright a "style".

pclmulqdq · on Oct 5, 2023

> To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.

Why jump straight to one of the longest popular books in English? I'm sure plenty of people who would otherwise buy poetry anthologies are asking ChatGPT to write them a haiku. It's also entirely conceivable that ChatGPT can perfectly reproduce a poem in its training set - it can produce paragraph-sized quotes.

rpdillon · on Oct 5, 2023

It was a reference to the copyright suit brought by the author.

https://apnews.com/article/openai-lawsuit-authors-grisham-ge...

joaogui1 · on Oct 5, 2023

Where does this 3% figure come from?

Zuiii · on Oct 5, 2023

These models are generally trained on tarabytes of data, but are usually 10s of gigabytes large (or much less if quantized). The latest true open source model, mistral 7b, is only 3GB (0.3% of a TB) when quantized.

rpdillon · on Oct 5, 2023

I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.

The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.

Here's my original back of the napkin analysis:

https://news.ycombinator.com/item?id=36681440

cmdli · on Oct 5, 2023

Copyright doesn’t apply to consumption of a work, only redistribution. For example, it is not a violation of copyright for you to make many copies of a movie for personal use. Similarly, it is not a violation of copyright for an artist to learn by looking at other works, even if they then compete against the original work. OpenAI’s system is much more similar to calculating statistics from public artwork than copying and redistributing artwork.

rpdillon · on Oct 5, 2023

> For example, it is not a violation of copyright for you to make many copies of a movie for personal use.

This likely varies by jurisdiction, but in the United States, copyright law covers four exclusive rights:

1. The right to make copies

2. The right to distribute copies

3. The right to create derivative works

4. The right to publicly perform the work

Making copies for personal use absolutely violates the first of these exclusive rights.

marcinzm · on Oct 5, 2023

Sure if you utterly ignore all of the fair use parts of US copyright law.

jazzyjackson · on Oct 5, 2023

The law is not clear, otherwise it wouldn't have to go to the supreme court to contrive the notion of "time-shifting" to allow for making personal copies.

https://en.m.wikipedia.org/wiki/Sony_Corp._of_America_v._Uni....

marcinzm · on Oct 5, 2023

Sure which is why saying it's absolutely a violation isn't a correct interpretation. So is saying it's absolutely not a violation. Even if the supreme court rules on the issue it's not absolutely anything because that ruling only applies until a future supreme court potentially changes its mind.

rpdillon · on Oct 5, 2023

I understand how my use of "absolutely" could be misleading. I'll attempt to clarify.

My understanding of the reality:

* Violations of those four rights give copyright holders standing to sue over duplication.

* Fair Use is a defense against that violation.

What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written, and because of this, you expose yourself to lawsuits when you make copies for personal use (assuming anyone could ever find out...I'm not, and never have, argued that copyright law is correct or good). The fact that you can defend yourself using a Fair Use defense is also important and a good point, but still requires hiring a lawyer, going to court, and winning. In my book, this is quite a bit different than saying it's not a violation.

IANAL, though I follow this fairly closely, and I'm not aware of any cases that have established that copying for personal use is Fair Use; I'd be very happy to read through a decision that establishes this!

dragonwriter · on Oct 5, 2023

> What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written

Copyright law “as the law is written” says fair use is not a violation. 17 USC § 107: “Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work [...] is not an infringement of copyright.”

rpdillon · on Oct 5, 2023

Right, but courts are the only ones that can make a fair use determination.

dragonwriter · on Oct 5, 2023

That's not true other than in the sense that “courts are the only ones that can make a determination that an act would violate copyright law outside of fair use considerations” is also true.

Its true that many times fair use will be somewhat harder to predict what a court would decide on than other aspects of copyright law, because the statute and precedent in combination are less clear, but that's not a categorical distinction.

marcinzm · on Oct 5, 2023

In the US you can be sued for anything by anyone so it's a question of degrees of risk rather than absolutes.

rpdillon · on Oct 5, 2023

Yes, I think we agree.

troupe · on Oct 5, 2023

> "Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).

If someone studies every one of Stephen King's books, studies King's style, studies his plots, etc. and uses the knowledge to learn how to be a writer, there is no need to invoke "fair use" unless they are actually copying King's books. Studying an author's corpus in order to be able to produce works like the author is not copyright infringement. It is just the way we learn...and the way LLMs learn as well.

If we want to say it isn't allowed because LLM read too quickly, then we could do that. Maybe make them not read any faster than humans. But LLMs aren't doing something different than humans other than just the speed at which they do it.

ang_cire · on Oct 5, 2023

If you read the PDF, you'd know that OpenAI admits they ARE actually copying his books, in order to train on them, and they are arguing that this is okay because they don't want to have to obtain a legal copy of the work, because that would be too expensive.

They are literally arguing they should be allowed to illegally download stuff so long as no human reads the downloaded work directly.

To further your analogy, it would be like your prospective writer objecting to being arrested for stealing the books from the bookstore because he just wanted to become a better writer by studying them, not read them for entertainment purposes.

troupe · on Oct 5, 2023

If you are saying that they illegally obtained King's book, then the issue is the fact that the stole the book from somewhere. I don't disagree with your assessment there. But if they checked the book out from the library and let the LLM read it, the fact that it has read it shouldn't be any different than if you had read it as a human.

If I break into the court house to let my child see the artwork in the public areas, the issue isn't whether my child can see, learn from, and copy the style of the artwork. The issue is that I broke into the court house instead of going in the front door when it was open to the public.

ang_cire · on Oct 5, 2023

> if they checked the book out from the library and let the LLM read it

Did you know it's actually illegal to photocopy books you check out from the library? It's an illegal copy.

That aside, the analogies between humans and ML models fall apart when you actually get into the details, which is why AI companies don't want that.

Human brains don't have RAM and Disk storage that keep direct copies of works. Computers do. Human brains are not entirely deterministic calculators. Computers are.

If you have all the state data of an AI model, you can know exactly what it will produce, and the input data is a necessary component of that state data, so it's not transformative in the same way as a human brain in the first place.

2 humans will not create the exact same output given a shared starting state in the way that 2 computers (AI models included) will.

The only reason people falsely believe AI models are different is because no one actually traces through that state data, and there are enough pseudo-random variables involved and state-changes that they create the illusion of non-determinism, but that is, in point of fact, an illusion.

troupe · on Oct 5, 2023

You are assuming that an LLM is saving a copy of the book it reads.

denton-scratch · on Oct 5, 2023

The LLM is the output of the training system; the LLM doesn't do the reading, the trainer does. The LLM doesn't "contain" a copy of the book.

But some LLMs seem to be able to quote on demand from a wide range of literature; these models evidently do have some kind of memory, even if it's been distilled and blended and distilled again.

ang_cire · on Oct 6, 2023

No, I am actually reading OpenAI's PDF linked here, where they say they make a copy of the online corpa to train on, right at the top of page 2:

> By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed

Whether they retain that later is immaterial to whether copyright infringement occurred. Deleting a movie you illegally downloaded after you watch (train on) it doesn't make it legal.

troupe · on Oct 7, 2023

I agree with you. What I'm trying to differentiate is between the way you obtain a book and whether training an LLM on a book that you own the rights to read should be considered copyright infringement.

ang_cire · on Oct 9, 2023

Unless the license you have for the book includes rights for commercial usage, it should be illegal to train a model that is being used for business, imo. In terms of individuals training models for their own personal use, I think that becomes much more tough to answer, but that's not really cropping up in the legal fights.

pclmulqdq · on Oct 5, 2023

LLMs don't read, they are trained. Yes, training involves saving a copy of the training corpus.

pests · on Oct 5, 2023

This reminds me of the Google Books lawsuit back in the 2010s. Google didn't ask permission from every copyright holder but was still allowed to scan books due to fair use.

ang_cire · on Oct 5, 2023

The big difference is that Google legally obtained the copies of the books they scanned. OpenAI did not. Publicly-accessible != legal to download/ copy.

The copyright infringement in Google's case was not illegally obtaining the original work, it was about the snippets they provided.

In OpenAI's case, it is also about the illegal obtainment (downloading) of the original work, which they admit to in their brief, but argue shouldn't be illegal if they intend only to transform it. That element was not present in AGI v Google.

kelseyfrog · on Oct 5, 2023

This very lawsuit is referenced in OpenAI's memo as precedent

> Perhaps the most compelling case on point is Authors Guild v. Google. There, search engine Google digitally scanned “tens of millions of books” without their authors’ permission for inclusion in a searchable digital database.

denton-scratch · on Oct 5, 2023

Yeah, I said that.

You're not the only commenter that has pointed out to me that the model doesn't contain the training corpus; that the model is analogous to a human writer who learns to write by becoming a voracious reader.

I acknowledge that, I think the analogy is fair. The problem is that the memory capacity of the model appears to be humungous; I've never played with ChatGPT, but I understand that it can quote accurately and at some length from a wide range of literature.

By the model, I mean the model behind ChatGPT. I suspect this "memory" amounts to having unimaginable numbers of parameters, i.e. something like overfit. I realize that's just one LLM, and there are other models that are neither generative nor language models, and that have to be much more restrained in what they consider a "parameter".

pests · on Oct 5, 2023

> Studying an author's corpus in order to be able to produce works like the author is not copyright infringement.

You didn't finish the analogy.

Not just produce "works like the author" but being able to regenerate those original works word-for-word.

Is that copyright infringement? If I have the entire Dark Tower series memorized word-for-word, and reproduce passages or entire chapters on-demand for others. It's just the way some people learn.

troupe · on Oct 5, 2023

Well yes. If you reproduce passages from a book (from memory or otherwise) and present it as your own, that is copyright infringement. We have lots of well established law about how much things can be similar without infringing.

LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.

pests · on Oct 5, 2023

> LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.

So all of them? None of the LLM's are "functioning correctly"? Because I can get every LLM to reproduce passages from a book.

CamperBob2 · on Oct 5, 2023

The copyright violation, if there is one, is committed by the person who extracts data from the model and uses or distributes it in ways not permitted by law.

Not by the person who feeds data into the model... and not by the model itself, for merely retaining an unrecognizable, unusable, highly-compressed and -transformed representation internally.

Seems simple enough. Any other judicial outcome will either halt the process of human intellectual evolution, or transfer a staggering amount of power to people who DGAF about US copyright law. So be careful what you wish for.