Based upon legal decisions in the past there is a clear argument that the distin...

Retric · 2025-06-16T09:23:07 1750065787

I think you may have something with that line of reasoning.

The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter

> Models themselves are very clearly transformative.

A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.

Models could easily be transformative, but the justification needs to go beyond well obviously they are.

Lerc · 2025-06-16T12:56:22 1750078582

Models are not word for word copies of large sections of text. They are capable of emitting that text though.

It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.

There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

int_19h · 2025-06-24T23:05:20 1750806320

> When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

If you take many gigabytes of, say, public domain music, and stick them on a flash drive with just one audio file that is an unlicensed copy of a copyrighted song, distributing that drive would constitute copyright infringement, quite obviously so. I don't see why it'd matter what else the model can produce, if it can produce that one thing verbatim by itself.

(If you could only prompt the model to regurgitate the original text with a framing of, say, critical analysis of said text around it, and not in any other context, then I think there would be a stronger fair use argument here.)

Retric · 2025-06-16T15:11:49 1750086709

> Is the encoding itself an infringement

Barring a fair use exception, yes.

From what I’ve read MP3’s get the same treatment as cassette tapes which were also lossy. It’s 1:1 digital copies that represented some novelty, but that rarely matters.

I’m hesitant to comment of the rest of that. The ultimate question isn’t if some difference exists but why that difference matters.

triceratops · 2025-06-16T15:50:28 1750089028

Training itself involves making infringing copies of protected works. Whether or not inference produces copyrighted material is almost beside the point.

johanyc · 2025-06-16T18:55:19 1750100119

It’s legal if it’s fair use, which is yet decided by court

yunwal · 2025-06-16T16:12:40 1750090360

No it doesn’t? You can buy a digital copy of Harry Potter and use it for training. No infringement needed.

triceratops · 2025-06-16T16:25:24 1750091124

Only as long as it's not copied again during training. You can't make copies of your purchased digital copy for any reason other than archival.

Retric · 2025-06-16T16:45:47 1750092347

Incidental copies during playback are also allowed. But none of these companies are paying for copies in the first place.

rpd9803 · 2025-06-16T16:33:34 1750091614

Copyright fair use rules are tools designed to govern how humans use protected works in dervied works. AI is not human use, therefore the rules are only coincidentally correct for AI use where it even is.

Lerc · 2025-06-16T18:22:33 1750098153

If you take that approach to fair use, don't you open the door to the same argument for copyright itself?

How do you distinguish between a tool and the director of a tool? I doubt people would say that a person is immune to copyright or fair use rules because it was the pen that wrote the document, not the person.

int_19h · 2025-06-24T23:08:13 1750806493

I think it's a valid question. Suppose you have two LLMs interacting with each other in a loop, and one randomly prompts the other to reproduce the entire text of Harry Potter, which the other then does. However, the chat log isn't actually stored anywhere, it's just a transient artifact of the interaction - so no human ever sees it nor can see it even in principle. Is it a copyright violation then? If it is, what are the damages?

Retric · 2025-06-16T22:14:04 1750112044

> don’t you open the door to the same argument for copyright itself?

Yes, it comes down to intentional control of output. Copyright applies when someone uses a pen to make a drawing because of the degree of control.

On the flip side there are copyright free photos where an animal picked up a camera etc, the same applies to a great deal of automatically generated data. The output of an LLM is likely in the public domain unless it’s a derivative work of something in the training set.