> But if openai had started out by seeking permission to train on any and every piece of content out there...
But why would anyone seek permission to use public data? Unless you've got Terms and Conditions on reading your website or you gatekeep it to registered users, it's public information, isn't it? Isn't public information what makes the web great? I just don't understand why people are upset about public data being used by AI (or literally anything else. Like open source, you can't choose who can use the information you're providing).
In the case being discussed here, it's obviously different, they used the voice of a particular person without their consent for profit. That's a totally separate discussion.
>why would anyone seek permission to use public data?
first of all it's not all public data. software licenses should already establish that just because something is on the internet doesn't mean it's free game.
Even if you want to bring up an archive of the pre-lawsuit TOS, I'd be surprised if that mostly wasn't the same TOS for decades. OpenAI didn't care.
>Isn't public information what makes the web great?
no. Twitter is "public information" (not really, but I'll go with your informal definition here). If that's what "public information" becomes then maybe we should curate for quality instead of quantity.
Spam is also public information and I don't need to explain how that only makes the internet worse. and honestly, that's what AI will become if left unchecked.
> Like open source, you can't choose who can use the information you're providing
That's literally what software licenses are for. You can't stop people from ignoring your license, but breaking that license opens you wide open for lawsuits.
The right to copy public information to read it does not grant the right to copy public information to feed it into a for-profit system to make a LLM that cannot function without the collective material that you took.
That's the debatable bit, isn't it. I will keep repeating that I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves, even though they paid exactly 0 money to any of these or even asked for permission. So what's the difference? The fact that AI can do it faster?
If people used the correct term for it, "lossy compression", then it would be clearer that yeah, definitely there's a line where systems like these are violating copyright and the only questions are:
1. where is the line that lossy compressions is violating copyright?
2. where are systems like chatgpt relative to that line?
I don't know that it's unreasonable to answer (1) with that even an extremely lossy compression can violate copyright. I mean, if I take your high-res 100MB photo, downsample it to something much smaller, losing even 99% of it, distributing that could still violate your copyright.
Again, how is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally? Isn't that the same? I also performed a "lossy compression" in my brain to do this.
> is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally?
That seems like a bad example, I think you are probably free to even read the book out loud in its entirety to me.
Are you able to record yourself doing that and sell it as an audiobook?
What if you do that, but change one word on each page to a synonym of that word?
10% of words to synonyms?
10% of paragraphs rephrased?
Each chapter just summarized?
The first point that seems easier to agree on isn't really about the specific line, just a recognition that there is a point that such a system crosses where we can all agree that it is copying and that then the interesting thing is just about where the boundaries of the grey area are (i.e. where are the points on that line that we agree that it is and isn't copying, with some grey area between them where we disagree or can't decide).
In one case, you are doing it and society is fine with that because a human being has inherent limitations. In other case, a machine is doing it which has different sets of limitations, which gives it vastly different abilities. That is the fundamental difference.
This also played out in the streetview debate - someone standing in public areas taking pictures of surroundings? No problem! An automated machine being driven around by a megacorp on every single street? Big problem.
There's an unstated assumption that some authors of blog posts have: if I make my post sufficiently complex, other humans will be compelled to link to my post and not rip it off by just paraphrasing it or duplicating it when somebody has a question my post can answer.
Now with AIs this assumption no longer holds and people are miffed that their work won't lead to engagement with their material, and the followers, stars, acknowledgement, validation, etc. that comes with that?
Either that or a fundamental misunderstanding of natural vs. legal rights.
- a human will be an organic visitor that can be advertised to. A bot is useless
- A human can one day be hired for their skills. An AI will always be in control of some other corporate entity.
- volume and speed is a factor. It's the buffet metaphor, "all you can eat" only works as long as it's a reasonable amount for a human to eat in a meal. Meanwhile, a bot will in fact "eat it all" and everyone loses.
- Lastly, commercial value applies to humans and bots. Even as a human I cannot simply rehost an article on my own site, especially if I pretend I read it. I might get away with it if it's just some simple blog, but if I'm pointing to patreons and running ads, I'll be in just as much trouble as a bot.
> I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves
tangential, but I should note that you in fact cannot just apply/implement everything you read. That's the entire reason or the copyright system. Always read the license or try to find a patent before doing anything commercially.
To me it's more like photocopying the contents of a thousand public libraries and then charging people to access to your private library. AI is different because you're creating a permanent, hard copy of the copyrighted works in your model vs. someone reading a bunch of material and struggling to recall the material.
But why would anyone seek permission to use public data? Unless you've got Terms and Conditions on reading your website or you gatekeep it to registered users, it's public information, isn't it? Isn't public information what makes the web great? I just don't understand why people are upset about public data being used by AI (or literally anything else. Like open source, you can't choose who can use the information you're providing).
In the case being discussed here, it's obviously different, they used the voice of a particular person without their consent for profit. That's a totally separate discussion.