The right to copy public information to read it does not grant the right to copy public information to feed it into a for-profit system to make a LLM that cannot function without the collective material that you took.
That's the debatable bit, isn't it. I will keep repeating that I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves, even though they paid exactly 0 money to any of these or even asked for permission. So what's the difference? The fact that AI can do it faster?
If people used the correct term for it, "lossy compression", then it would be clearer that yeah, definitely there's a line where systems like these are violating copyright and the only questions are:
1. where is the line that lossy compressions is violating copyright?
2. where are systems like chatgpt relative to that line?
I don't know that it's unreasonable to answer (1) with that even an extremely lossy compression can violate copyright. I mean, if I take your high-res 100MB photo, downsample it to something much smaller, losing even 99% of it, distributing that could still violate your copyright.
Again, how is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally? Isn't that the same? I also performed a "lossy compression" in my brain to do this.
> is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally?
That seems like a bad example, I think you are probably free to even read the book out loud in its entirety to me.
Are you able to record yourself doing that and sell it as an audiobook?
What if you do that, but change one word on each page to a synonym of that word?
10% of words to synonyms?
10% of paragraphs rephrased?
Each chapter just summarized?
The first point that seems easier to agree on isn't really about the specific line, just a recognition that there is a point that such a system crosses where we can all agree that it is copying and that then the interesting thing is just about where the boundaries of the grey area are (i.e. where are the points on that line that we agree that it is and isn't copying, with some grey area between them where we disagree or can't decide).
In one case, you are doing it and society is fine with that because a human being has inherent limitations. In other case, a machine is doing it which has different sets of limitations, which gives it vastly different abilities. That is the fundamental difference.
This also played out in the streetview debate - someone standing in public areas taking pictures of surroundings? No problem! An automated machine being driven around by a megacorp on every single street? Big problem.
There's an unstated assumption that some authors of blog posts have: if I make my post sufficiently complex, other humans will be compelled to link to my post and not rip it off by just paraphrasing it or duplicating it when somebody has a question my post can answer.
Now with AIs this assumption no longer holds and people are miffed that their work won't lead to engagement with their material, and the followers, stars, acknowledgement, validation, etc. that comes with that?
Either that or a fundamental misunderstanding of natural vs. legal rights.
- a human will be an organic visitor that can be advertised to. A bot is useless
- A human can one day be hired for their skills. An AI will always be in control of some other corporate entity.
- volume and speed is a factor. It's the buffet metaphor, "all you can eat" only works as long as it's a reasonable amount for a human to eat in a meal. Meanwhile, a bot will in fact "eat it all" and everyone loses.
- Lastly, commercial value applies to humans and bots. Even as a human I cannot simply rehost an article on my own site, especially if I pretend I read it. I might get away with it if it's just some simple blog, but if I'm pointing to patreons and running ads, I'll be in just as much trouble as a bot.
> I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves
tangential, but I should note that you in fact cannot just apply/implement everything you read. That's the entire reason or the copyright system. Always read the license or try to find a patent before doing anything commercially.
To me it's more like photocopying the contents of a thousand public libraries and then charging people to access to your private library. AI is different because you're creating a permanent, hard copy of the copyrighted works in your model vs. someone reading a bunch of material and struggling to recall the material.