"Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).
According to WP, there are four factors to consider:
1. Purpose, including whether it's commercial.
2. Nature of the work (e.g. fiction/non-fiction, published/unpublished)
3. How much was used
4. Whether the use impairs the author's ability to exploit his work
====
1. The use is definitely commercial.
2. This use consumes anything and everything.
3. This use consumes entire libraries.
4. It can diminish the market for the author's own work (observe authors complaining that GPT simulacra of their work are competing with their own publications on Amazon).
Each of these factors is to be weighed by the court; no one factor is decisive. But, for example, if you use the entire work, factor #2 weighs against you. It looks to me that the four factors all weigh against OpenAI.
The models that are trained on these inputs absolutely do not contain all of the text that they're trained on. Even with non-quantized models you're looking at somewhere in the realm of 3% of the training data being present in the final model.
Further, these models are not used to reproduce copyrighted works... they would be one of the least efficient ways to do so.
I also don't think there's a particularly strong argument that training on a book or image decreases the market for the original book or image. To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.
I think what's really happened here is that artists' work has utility outside of the originally intended audience, that utility largely comes from statistically analyzing the writing, and the debate is about whether or not the utility of those statistics requires additional compensation for the artists. That question seems to hinge on the intricacies of copyright law, but as I've stated before, I think the fair use argument is quite strong.
>I also don't think ... that training on a book or image decreases the market for the original book or image. ... zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones ...
For now. But when you can generate work in the style of an artist without paying that artist, I would. More importantly, media production companies will and so will use access to people's post work to put them out of work now.
We need to distribute the benefits, we're drifting towards copyright infringement being allowed by rich corporation (with resources to generate "AI") as long as they abstract it through a transformer, or whatever.
> To take an example, zero folks that want to read Game of Thrones are asking ChatGPT to generate Game of Thrones for them instead of buying the book.
Why jump straight to one of the longest popular books in English? I'm sure plenty of people who would otherwise buy poetry anthologies are asking ChatGPT to write them a haiku. It's also entirely conceivable that ChatGPT can perfectly reproduce a poem in its training set - it can produce paragraph-sized quotes.
These models are generally trained on tarabytes of data, but are usually 10s of gigabytes large (or much less if quantized). The latest true open source model, mistral 7b, is only 3GB (0.3% of a TB) when quantized.
I did a very similar analysis with Llama 65B being trained on 5.6T tokens assuming token length of 4 characters and comparing with a quantized model size of ~38GB.
The 3% number was a conservative rounding of the same calculation, but retaining fp16 rather than quantizing to 4 bits.
Copyright doesn’t apply to consumption of a work, only redistribution. For example, it is not a violation of copyright for you to make many copies of a movie for personal use. Similarly, it is not a violation of copyright for an artist to learn by looking at other works, even if they then compete against the original work. OpenAI’s system is much more similar to calculating statistics from public artwork than copying and redistributing artwork.
The law is not clear, otherwise it wouldn't have to go to the supreme court to contrive the notion of "time-shifting" to allow for making personal copies.
Sure which is why saying it's absolutely a violation isn't a correct interpretation. So is saying it's absolutely not a violation. Even if the supreme court rules on the issue it's not absolutely anything because that ruling only applies until a future supreme court potentially changes its mind.
I understand how my use of "absolutely" could be misleading. I'll attempt to clarify.
My understanding of the reality:
* Violations of those four rights give copyright holders standing to sue over duplication.
* Fair Use is a defense against that violation.
What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written, and because of this, you expose yourself to lawsuits when you make copies for personal use (assuming anyone could ever find out...I'm not, and never have, argued that copyright law is correct or good). The fact that you can defend yourself using a Fair Use defense is also important and a good point, but still requires hiring a lawyer, going to court, and winning. In my book, this is quite a bit different than saying it's not a violation.
IANAL, though I follow this fairly closely, and I'm not aware of any cases that have established that copying for personal use is Fair Use; I'd be very happy to read through a decision that establishes this!
> What I was trying to get at in my first post above is that it is a violation of copyright law as the law is written
Copyright law “as the law is written” says fair use is not a violation. 17 USC § 107: “Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work [...] is not an infringement of copyright.”
That's not true other than in the sense that “courts are the only ones that can make a determination that an act would violate copyright law outside of fair use considerations” is also true.
Its true that many times fair use will be somewhat harder to predict what a court would decide on than other aspects of copyright law, because the statute and precedent in combination are less clear, but that's not a categorical distinction.
> "Fair use" doesn't generally include taking the author's entire work (or for that matter, his entire lifetime's corpus).
If someone studies every one of Stephen King's books, studies King's style, studies his plots, etc. and uses the knowledge to learn how to be a writer, there is no need to invoke "fair use" unless they are actually copying King's books. Studying an author's corpus in order to be able to produce works like the author is not copyright infringement. It is just the way we learn...and the way LLMs learn as well.
If we want to say it isn't allowed because LLM read too quickly, then we could do that. Maybe make them not read any faster than humans. But LLMs aren't doing something different than humans other than just the speed at which they do it.
If you read the PDF, you'd know that OpenAI admits they ARE actually copying his books, in order to train on them, and they are arguing that this is okay because they don't want to have to obtain a legal copy of the work, because that would be too expensive.
They are literally arguing they should be allowed to illegally download stuff so long as no human reads the downloaded work directly.
To further your analogy, it would be like your prospective writer objecting to being arrested for stealing the books from the bookstore because he just wanted to become a better writer by studying them, not read them for entertainment purposes.
If you are saying that they illegally obtained King's book, then the issue is the fact that the stole the book from somewhere. I don't disagree with your assessment there. But if they checked the book out from the library and let the LLM read it, the fact that it has read it shouldn't be any different than if you had read it as a human.
If I break into the court house to let my child see the artwork in the public areas, the issue isn't whether my child can see, learn from, and copy the style of the artwork. The issue is that I broke into the court house instead of going in the front door when it was open to the public.
> if they checked the book out from the library and let the LLM read it
Did you know it's actually illegal to photocopy books you check out from the library? It's an illegal copy.
That aside, the analogies between humans and ML models fall apart when you actually get into the details, which is why AI companies don't want that.
Human brains don't have RAM and Disk storage that keep direct copies of works. Computers do. Human brains are not entirely deterministic calculators. Computers are.
If you have all the state data of an AI model, you can know exactly what it will produce, and the input data is a necessary component of that state data, so it's not transformative in the same way as a human brain in the first place.
2 humans will not create the exact same output given a shared starting state in the way that 2 computers (AI models included) will.
The only reason people falsely believe AI models are different is because no one actually traces through that state data, and there are enough pseudo-random variables involved and state-changes that they create the illusion of non-determinism, but that is, in point of fact, an illusion.
The LLM is the output of the training system; the LLM doesn't do the reading, the trainer does. The LLM doesn't "contain" a copy of the book.
But some LLMs seem to be able to quote on demand from a wide range of literature; these models evidently do have some kind of memory, even if it's been distilled and blended and distilled again.
No, I am actually reading OpenAI's PDF linked here, where they say they make a copy of the online corpa to train on, right at the top of page 2:
> By analyzing large corpora (which necessarily involves first making copies of the data to be analyzed
Whether they retain that later is immaterial to whether copyright infringement occurred. Deleting a movie you illegally downloaded after you watch (train on) it doesn't make it legal.
I agree with you. What I'm trying to differentiate is between the way you obtain a book and whether training an LLM on a book that you own the rights to read should be considered copyright infringement.
Unless the license you have for the book includes rights for commercial usage, it should be illegal to train a model that is being used for business, imo. In terms of individuals training models for their own personal use, I think that becomes much more tough to answer, but that's not really cropping up in the legal fights.
This reminds me of the Google Books lawsuit back in the 2010s. Google didn't ask permission from every copyright holder but was still allowed to scan books due to fair use.
The big difference is that Google legally obtained the copies of the books they scanned. OpenAI did not. Publicly-accessible != legal to download/ copy.
The copyright infringement in Google's case was not illegally obtaining the original work, it was about the snippets they provided.
In OpenAI's case, it is also about the illegal obtainment (downloading) of the original work, which they admit to in their brief, but argue shouldn't be illegal if they intend only to transform it. That element was not present in AGI v Google.
This very lawsuit is referenced in OpenAI's memo as precedent
> Perhaps the most compelling case on point is Authors Guild v. Google. There, search engine Google digitally scanned “tens of millions of books” without their authors’ permission for inclusion in a searchable digital database.
You're not the only commenter that has pointed out to me that the model doesn't contain the training corpus; that the model is analogous to a human writer who learns to write by becoming a voracious reader.
I acknowledge that, I think the analogy is fair. The problem is that the memory capacity of the model appears to be humungous; I've never played with ChatGPT, but I understand that it can quote accurately and at some length from a wide range of literature.
By the model, I mean the model behind ChatGPT. I suspect this "memory" amounts to having unimaginable numbers of parameters, i.e. something like overfit. I realize that's just one LLM, and there are other models that are neither generative nor language models, and that have to be much more restrained in what they consider a "parameter".
> Studying an author's corpus in order to be able to produce works like the author is not copyright infringement.
You didn't finish the analogy.
Not just produce "works like the author" but being able to regenerate those original works word-for-word.
Is that copyright infringement? If I have the entire Dark Tower series memorized word-for-word, and reproduce passages or entire chapters on-demand for others. It's just the way some people learn.
Well yes. If you reproduce passages from a book (from memory or otherwise) and present it as your own, that is copyright infringement. We have lots of well established law about how much things can be similar without infringing.
LLMs that are just spitting out full copies of what they were trained on aren't functioning correctly or haven't been trained on enough content.
The copyright violation, if there is one, is committed by the person who extracts data from the model and uses or distributes it in ways not permitted by law.
Not by the person who feeds data into the model... and not by the model itself, for merely retaining an unrecognizable, unusable, highly-compressed and -transformed representation internally.
Seems simple enough. Any other judicial outcome will either halt the process of human intellectual evolution, or transfer a staggering amount of power to people who DGAF about US copyright law. So be careful what you wish for.
According to WP, there are four factors to consider:
1. Purpose, including whether it's commercial.
2. Nature of the work (e.g. fiction/non-fiction, published/unpublished)
3. How much was used
4. Whether the use impairs the author's ability to exploit his work
====
1. The use is definitely commercial.
2. This use consumes anything and everything.
3. This use consumes entire libraries.
4. It can diminish the market for the author's own work (observe authors complaining that GPT simulacra of their work are competing with their own publications on Amazon).
Each of these factors is to be weighed by the court; no one factor is decisive. But, for example, if you use the entire work, factor #2 weighs against you. It looks to me that the four factors all weigh against OpenAI.