Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Based upon legal decisions in the past there is a clear argument that the distinction for fair use is whether a work is substantially different to another. You are allowed to write a book containg information you learned about from another book. There is threshold in academia regarding plagiarism that stands apart from the legal standing. The measure that was used in Gyles v Wilcox was if the new work could substitute for the old. Lord Hardwicke had the wisdom to defer to experts in the field as to what the standard should be for accepting something as meaningfully changed.

Recent decisions such as Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith have walked a fine line with this. I feel like the supreme court got this one wrong because the work is far more notable as a Warhol than as a copy of a photograph, perhaps that substitution rule should be a two way street. If the original work cannot substitute for the copy, then clearly the copy must be transformative.

LLMs generating works verbatim might be an infringement of copyright (probably not), distributing those verbatim works without a licence certainly would be. In either case, it is probably considered a failure of the model, Open AI have certainly said that such reproductions shouldn't happen and they consider it a failure mode when it does. I haven't seen similar statements from other model producers, but it would not surprise me if this were the standard sentiment.

Humans looking at works and producing things in a similar style is allowed, indeed this is precisely what art movements are. The same transformative threshold applies. If you draw a cartoon mouse, that's ok, but if people look at it and go "It's Mickey mouse" then it's not. If it's Mickey to tiki Tu meke, it clearly is Mickey but it is also clearly transformative.

Models themselves are very clearly transformative. Copyright itself was conceived at a time when generated content was not considered possible so the notion of the output of a transformative work being a non transformative derivative of something else was never legally evaluated.



I think you may have something with that line of reasoning.

The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter

> Models themselves are very clearly transformative.

A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.

Models could easily be transformative, but the justification needs to go beyond well obviously they are.


Models are not word for word copies of large sections of text. They are capable of emitting that text though.

It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.

There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?


> When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

If you take many gigabytes of, say, public domain music, and stick them on a flash drive with just one audio file that is an unlicensed copy of a copyrighted song, distributing that drive would constitute copyright infringement, quite obviously so. I don't see why it'd matter what else the model can produce, if it can produce that one thing verbatim by itself.

(If you could only prompt the model to regurgitate the original text with a framing of, say, critical analysis of said text around it, and not in any other context, then I think there would be a stronger fair use argument here.)


> Is the encoding itself an infringement

Barring a fair use exception, yes.

From what I’ve read MP3’s get the same treatment as cassette tapes which were also lossy. It’s 1:1 digital copies that represented some novelty, but that rarely matters.

I’m hesitant to comment of the rest of that. The ultimate question isn’t if some difference exists but why that difference matters.


Training itself involves making infringing copies of protected works. Whether or not inference produces copyrighted material is almost beside the point.


It’s legal if it’s fair use, which is yet decided by court


No it doesn’t? You can buy a digital copy of Harry Potter and use it for training. No infringement needed.


Only as long as it's not copied again during training. You can't make copies of your purchased digital copy for any reason other than archival.


Incidental copies during playback are also allowed. But none of these companies are paying for copies in the first place.


Copyright fair use rules are tools designed to govern how humans use protected works in dervied works. AI is not human use, therefore the rules are only coincidentally correct for AI use where it even is.


If you take that approach to fair use, don't you open the door to the same argument for copyright itself?

How do you distinguish between a tool and the director of a tool? I doubt people would say that a person is immune to copyright or fair use rules because it was the pen that wrote the document, not the person.


I think it's a valid question. Suppose you have two LLMs interacting with each other in a loop, and one randomly prompts the other to reproduce the entire text of Harry Potter, which the other then does. However, the chat log isn't actually stored anywhere, it's just a transient artifact of the interaction - so no human ever sees it nor can see it even in principle. Is it a copyright violation then? If it is, what are the damages?


> don’t you open the door to the same argument for copyright itself?

Yes, it comes down to intentional control of output. Copyright applies when someone uses a pen to make a drawing because of the degree of control.

On the flip side there are copyright free photos where an animal picked up a camera etc, the same applies to a great deal of automatically generated data. The output of an LLM is likely in the public domain unless it’s a derivative work of something in the training set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: