You really don't see the difference between Google indexing the content of third...

imgabe · 2025-06-16T04:22:28 1750047748

Hosting model weights is not hosting / distributing the content.

abtinf · 2025-06-16T04:29:35 1750048175

Of course it is.

It's just a form of compression.

If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.

Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?

aschobel · 2025-06-16T05:27:36 1750051656

Indeed! It is a form of massive lossy compression.

> Llama 3 70B was trained on 15 trillion tokens

That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.

LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.

amlib · 2025-06-16T21:15:58 1750108558

But LLMs cant write code nor explain math, they only plagiarize existing code and plagiarize existing explanations of math.

imgabe · 2025-06-16T04:43:03 1750048983

[flagged]

tsimionescu · 2025-06-16T06:30:27 1750055427

> For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.

There is nothing inherently probabilistic in a neural network. The neural net always outputs the exact same value for the same input. We typically use that value in a larger program as a probability of a certain token, but that is not required to get data out. You could just as easily determinsitically take the output with the highest value, and add some extra rule for when multiple outputs have the exact same (e.g. pick the one from the output neuron with the lowest index).

vrighter · 2025-06-16T05:47:22 1750052842

I have, but I never tried to make any money off of it either

xigoi · 2025-06-16T11:12:54 1750072374

> For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.

If I make a compression algorithm that randomly changes some pixels, can I use it to distribute pirated movies?

bakugo · 2025-06-16T05:17:18 1750051038

> Have you ever repeated a line from your favorite movie or TV show? Memorized a poem? Guess the rights holders better sue you for stealing their content by encoding it in your wetware neural network.

I see this absolute non-argument regurgitated ad infinitum in every single discussion on this topic, and at this point I can't help but wonder: doesn't it say more about the person who says it than anything else?

Do you really consider your own human speech no different than that of a computer algorithm doing a bunch of matrix operations and outputting numbers that then get turned into text? Do you truly believe ChatGPT deserves the same rights to freedom of speech as you do?

imgabe · 2025-06-16T05:26:46 1750051606

Who said anything about freedom of speech? Nobody is claiming the LLM has free speech rights, which don't even apply to infringing copyright anyway. Freedom of speech doesn't give me the right to make copies of copyrighted works.

The question is whether the model weights constitute of copy of the work. I contend that they do not, or they did, than so do the analogous weights (reinforced neural pathways) in your brain, which is clearly absurd and is intended to demonstrate the absurdity of considering a probabilistic weighting that produces similar text to be a copy.

bakugo · 2025-06-16T05:38:19 1750052299

> Freedom of speech doesn't give me the right to make copies of copyrighted works.

No, but it gives you the right to quote a line from a movie or TV show without being charged with copyright infringement. You argued that an LLM deserves that same right, even if you didn't realize it.

> than so do the analogous weights (reinforced neural pathways) in your brain

Did your brain consume millions of copyrighted books in order to develop into what it is today? Would your brain be unable to exist in its current form if it had not consumed those millions of books?

imgabe · 2025-06-16T05:49:16 1750052956

Millions? No, but my brain certainly consumed thousands of books, movies, TV shows, pieces of music, artworks, and other copyrighted material. Where is the cutoff? Can I only consume 999,999 copyrighted works before I'm not longer allowed to remember something without infringing copyright? My brain definitely would not exist in its current form without consuming that material. It would exist in some form, but it would without a doubt be different than it is having consumed the material.

An LLM is not a person and does not deserve any rights. People have rights, including the right to use tools like LLMs without having to grease the palm of every grubby rights holder (or their great-great-grandchild) just because it turns out their work was so trite and predictable it could be reproduced by simply guessing the next most likely token.

em-bee · 2025-06-16T07:21:37 1750058497

i can remember and i can quote, but if i quote to much i violate the copyright.

this is literally why i don't like to work on proprietary code. because when i need to create a similar solution for someone else i have to go out of my way to make sure i do it differently. people have been sued over this.

bakugo · 2025-06-16T06:38:03 1750055883

> just because it turns out their work was so trite and predictable it could be reproduced by simply guessing the next most likely token.

Well, if you have no idea how LLMs work, you could've just said so.

lern_too_spel · 2025-06-16T07:35:04 1750059304

Making personal copies is generally permitted. If I were to distribute the neural pathways in my brain enabling others to reproduce copyrighted works verbatim, the owners of the copyrighted works would have a case against me.

homebrewer · 2025-06-16T07:33:12 1750059192

Repeating half of the book verbatim is not nearly the same as repeating a line.

imgabe · 2025-06-16T08:07:21 1750061241

If you prompt the LLM to output a book verbatim, then you violated the copyright, not the LLM. Just like if you take a book to a copier and make a copy of it, you are violating the copyright, not Xerox.

whattheheckheck · 2025-06-16T08:55:34 1750064134

What if the printer had a button that printed a copy of the book on demand?

invalidusernam3 · 2025-06-16T07:49:01 1750060141

Difference is if it's used commercially or not. Me singing my favourite song at karaoke is fine, but me recording that and releasing it on Spotify is not

abtinf · 2025-06-16T04:57:42 1750049862

[flagged]

imgabe · 2025-06-16T05:11:15 1750050675

No, the second point does not concede the argument. You were talking about the model output infringing the copyright, the second point is talking about the model input infringing the copyright, e.g. if they made unauthorized copies in the process of gathering data to train the model such as by pirating the content. That is unrelated to whether the model output is infringing.

You don't seem to be in a very good position to judge what is and is not obtuse.

Wowfunhappy · 2025-06-16T15:24:21 1750087461

I would be inclined to agree except apparently 42% of the first Harry Potter book is encoded in the model weights...

Zambyte · 2025-06-16T04:21:25 1750047685

Where are they putting any blame on Google here?

abtinf · 2025-06-16T04:32:01 1750048321

Where did I say they were?

Zambyte · 2025-06-16T13:42:36 1750081356

When you juxtaposed Google indexing with third parties hosting the content...?

nashashmi · 2025-06-16T05:15:36 1750050936

The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.