Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

Suppose we accept all of the above. What does that hold for human learning?



If a human were to reproduce, from memory, a copyrighted work, that would be illegal as well, and multiple people have been sued over it, even doing it unintentionally.

I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.


>I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

In that case I don't think there's anything controversial here? Nobody thinks that if you ask AI to reproduce something verbatim, that you should get a pass because it's AI. All the controversy in this thread seems to be around the training process and whether that breaks copyright laws.


No -- the controversy is also over whether distributing the weights and software is a copyright violation. I believe that is. The copyrighted material is present in the software in some form, even if the process for regenerating it is quite convoluted.


It's not as clear-cut as you think. The courts have held that both google thumbnails and google books are fair use, even though they're far closer to verbatim copies than an AI model.


The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

That's a problem, regardless of how you go about it. It's probably fine if I watch a movie with my friends, who cares. But distributing it over the internet for free is a different issue.


>The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

>LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

What's an LLM supposed to be a substitute for? Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times? Same goes for movies. No one is substituting marvel movies with sora video.


> Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times?

Yes.

> No one is substituting marvel movies with sora video.

Yeah because sora kind of sucks. It's great technology, but turns out text is just a little bit easier to generate than 3D videos.

Once sora gets good, you bet your ass they will.


Whereas, my report showed they were breaking copyright before the training process. Meta was sued for what I said they'd be sued for, too.

Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.

So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.

Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).

So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: