I'd make the case that if training an A.I. on text is a copyright infringement t...

evdubs · on Oct 5, 2023

Copyright infringement is about making a copy. It is not making a copy if you read a book. It is not making a copy if you watch TV. It is not making a copy if you listen to the radio. If Denys makes a video/audio file of himself quoting a large segment from "Friends", that's copyright infringement.

When machines train on a corpus of data, that corpus data is a copy and is almost certainly infringing on copyright (depending on how they acquired it). When machines use their training model and reproduce verbatim copyrighted works that they've been trained with, that copy is infringing (depending on licenses they acquired to reproduce copyrighted data).

PaulHoule · on Oct 5, 2023

When I read a book I read a "copy" of that book, which was typically reproduced on paper under contract with the copyright owner.

The model is not capable of reproducing the input verbatim (not enough synapses to do that and it wouldn't generalize properly if there were) although the training process does involve producing a cleaned up text which is typically retained. (The point of building your language skills from watching "Friends" or an LLM developing linguistic/world knowledge competence or the simulation thereof from reading text is not that the text can be regurgitated, but that you or the the model develops a generalized capacity.)

(... if LLMs were able to remember everything they read, why do we always catch them confabulating?)

It would be possible in principle to point a camera at the book, OCR it in real time, and feed that into the training process. Of course this would be slow and wouldn't benefit from the cleanup process that goes into making a training set.

Note that somebody could stick a fine grid of wires into my V1 visual cortex and extract something that looks very much like an image of what I am focusing on at the moment. Thus, for just a few milliseconds an image "copy" of a text I am reading exists in my brain.

eviks · on Oct 5, 2023

That'd be a very weak case as it rests entirely on a superficial comparison, but then since AI isn't a human and doesn't train like a human, it doesn't make much sense to equate these processes and apply the regulations made for humans wholesale

PaulHoule · on Oct 5, 2023

It's not superficial in the slightest. There is a cult of creativity that leads people to say things like "a neural network can never do anything creative" and instead that it just parrots back exactly what it's been told but no, it has an ability to generalize, which is why diffusion models trained to draw pornography draw girls with 7 belly buttons and you don't have to be worried that ChatGPT will tell you how to build a bomb because it will tell you how to build a hydrogen bomb using shaving cream and ski wax.

When I read a book it changes connections weights between the synapses in my brain. I don't retain an exact copy but I recall the gist of the book and some facts (imperfectly) I have the ability to generalize this information (I can read The Tale of Genji and retell a story based on it that is set on a Mars Colony in 4712AD) and it can improve my linguistic abilities in general.

eviks · on Oct 5, 2023

One of the huge missing parts is meaning, and rephrasing your reading of a book into an AI "weights" language is, sorry, still very superficial since these aren't the same, just like you don't have some simple gradient descent optimization function in your brain, so you can't substantively say "oh, our brain just optimizes for something just like AI, so it's the same learning process, so same rules should apply"

PaulHoule · on Oct 5, 2023

Even people like Jurgen Habermas don't really know what they mean when they say "meaning", it's one of the most problematic concepts in all of philosophy.

Humans read stuff and think they understood the "meaning" but they really didn't. If you don't believe me, try grading a quiz.

In the case of A.I. engineering a very useful form of "understanding" is a decomposition into facts. For instance if an A.I. read an article about a sports game it ought to be able to generate a database entry saying who played, where, who won, what the score is, etc.

It sounds very easy but this kind of information extraction is preparadigmatic and I know because I've tried to build this kind of system. LLMs beat older system at it but that's because older systems fail at it completely.

In the case of the game, for instance, simple pattern matching falls down for quite a few reason, not least that a good account of a game will probably give the same facts for the last game that team A played and the last game that team B played and also the last game A and B played, not to mention league standing information that might superficially look like the score of a game.

That kind of "meaning" is explicitly allowed to be copied under copyright law, see

https://libraries.emory.edu/research/copyright/copyright-dat...

eviks · on Oct 5, 2023

The fact that Jurgen doesn't know doesn't help your argument, you're the one creating the identity to apply the same laws in the same way, and if you don't know about some relevant property of human brain activity, you can't make a claim that it's identical to whatever AI is doing.