>Okay, so now we have AI. We can ignore the algorithm entirely and how it works,...

empath75 · 2025-05-12T16:55:32 1747068932

If a human were to reproduce, from memory, a copyrighted work, that would be illegal as well, and multiple people have been sued over it, even doing it unintentionally.

I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

gruez · 2025-05-12T17:29:50 1747070990

>I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

In that case I don't think there's anything controversial here? Nobody thinks that if you ask AI to reproduce something verbatim, that you should get a pass because it's AI. All the controversy in this thread seems to be around the training process and whether that breaks copyright laws.

empath75 · 2025-05-12T19:42:03 1747078923

No -- the controversy is also over whether distributing the weights and software is a copyright violation. I believe that is. The copyrighted material is present in the software in some form, even if the process for regenerating it is quite convoluted.

gruez · 2025-05-12T20:33:22 1747082002

It's not as clear-cut as you think. The courts have held that both google thumbnails and google books are fair use, even though they're far closer to verbatim copies than an AI model.

const_cast · 2025-05-12T21:27:34 1747085254

The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

That's a problem, regardless of how you go about it. It's probably fine if I watch a movie with my friends, who cares. But distributing it over the internet for free is a different issue.

gruez · 2025-05-12T21:50:04 1747086604

>The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

>LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

What's an LLM supposed to be a substitute for? Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times? Same goes for movies. No one is substituting marvel movies with sora video.

const_cast · 2025-05-12T21:59:32 1747087172

> Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times?

Yes.

> No one is substituting marvel movies with sora video.

Yeah because sora kind of sucks. It's great technology, but turns out text is just a little bit easier to generate than 3D videos.

Once sora gets good, you bet your ass they will.

nickpsecurity · 2025-05-13T02:25:10 1747103110

Whereas, my report showed they were breaking copyright before the training process. Meta was sued for what I said they'd be sued for, too.

Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.

So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.

Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).

So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?