Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Fair use" only applies to instances of copying / redistributing. The hint is in the name: copy-right.

There's a notion, which seems to have taken off among creators who are paranoid about AI eating their livelihoods (which it might eat a chunk of) that copyright prevents people from doing anything with works they [legally] acquired other than personally read, listen, or watch it.

That's not how copyright, as it has existed in the past, works. You can do all the algorithmic processing of your ebook collection that you want. You might be able to display small portions of a book to others, depending on the situation.

Quoting one or two paragraphs out of an entire book seems like reasonably safe fair use, but that won't stop a copyright-maximalist creator (or their publisher) from suing you, and won't stop some copyright-maximalist judge from ruling against you, so it's probably best to minimize the amount of content from a book that you redisplay directly. But you can do all the analysis and statistics generation you want, and display those results to others.

It remains to be seen what judges will do with AI generation of works based on ingesting gigantic amounts of copyrighted work. The entire framework of copyright is going to be broken, and until Congress steps in and changes it, judges are going to go every which way. There's no bright line for 4-factor analysis; it's always been a gut-level "is this a reasonable use that doesn't impact commercial sales too much". There's no possible rational way to draw a line. AI models can generate a painting of a new subject only loosely in the style of a contemporary painter, which would not be copyright infringement, or it can generate a near-clone of an existing work with the right prompting, and depending on how clever the prompter is, a lot of intermediate stages of likeness. Who decides how close to an existing work is too close?



Say I make a tool where you can enter the title of a book, and get the full text of the book without paying for it. I assume we all agree that would be illegal, right?

Now say that instead of distributing that tool as an executable, I distributed it as a library. It would contain all the same books as the illegal executable above, but some developers would need to write an actual executable that would use the library. Obviously that executable would be illegal (same as the first one above), but would the library be illegal? I would think that we all reasonably agree that my library would be illegal, because I would essentially be selling the content of books for which I don't own the copyright.

Now what happens when I distribute a trained model that has the potential to generate illegal work? Is it suddenly legal for me to sell it, because some developers write legal executables with it? How is that different from my library example above?


Google Books already does exactly this. It has a library of the full text of millions of books. Users can search for a passage of text and google will display the paragraph where the passage is found.

https://books.google.com

example:

https://i.ibb.co/DCxJpHN/IMG-3143.jpg


Google does not provide the full book, does it? Exactly like they could provide a few seconds of a song, but not the song in its entirety.


No, they don't provide the full book, just a few sentences before and after your search prompt (same as Prosecraft).

In both cases, however, if you had the patience, you could search the last few words of the text returned from your prior query and slowly work your way through the entire book.


A few sentences? For most books I've seen, it's a few pages. Google will block you from retrieving more pages from the same book eventually. Using a VPN and a different account may get around one limit, but I experimented with multiple VPNs and browsers once, and although I was able to get a majority of a book's pages, after that google stopped showing me full previews of any of the remaining pages no matter where the request came from.


It shows you a few pages if you are previewing the book (i.e. "look inside"). But if you are using search, it will show you where your search query shows up in the book, no matter what page the search query is found. This means you could theoretically search a book sentence by sentence, and it will eventually have shown you the entire book. I'm not claiming this is an efficient or practical way to game the system and read books, only that google books does contain the full copy of the book text and can reveal the contents of any passage. This is basically how Prosecraft works (at least what i glean from the article) - it doesn't let you read a whole book, even though it may contain a representation of the full text.


Sure. I really did not mean that specifically for Prosecraft. But the article questions why authors are attacking Prosecraft "because it does no harm". My answer is that authors don't (and can't, really) make the difference in a per-case basis.

At this point what they see is that LLMs trained on their copyrighted material are able to generate similar material thanks to their copyrighted material that was used in the training (that is important!), and they see that they won't get paid for that. Of course they are scared, and they should be. And of course they will now start attacking everything that looks like it is using their copyrighted material as training data.

I really don't get why the engineering world does not get this: LLMs have the potential to ruin people's jobs, it is not clear at all that this is legit (IMO LLMs could not do it without the copyrighted material they used for training, therefore they are derivatives of the original work), and those people are rightfully scared.


I think most people who have thought about it understand the impact AI models seem destined to have on writing (and digital 2d art, soon music, and later other things). In addition to writers and voice artists panicking, see the Hollywood strikes, for instance, and what's currently happening in the corporate world to digital artists.

Copyright is not the correct tool to address it. In the U.S., the basis for copyright law is a clause in the constitution which allows Congress to pass laws to secure for limited time creators' exclusive rights to their creations, for the purpose of "promot[ing] the progress of science and useful arts."

If AI is going to displace nearly all of that content, at least content that's mostly entertainment and less educational (novels, music, images, eventually movies), doesn't that imply that AI is better at generating useful entertainment than humans are? Aren't hypothetical laws restricting training of AI on past works to protect the market for future human-created works a direct violation of the constitution, if AI is better at generating "useful" entertainment (entertainment that people will buy) than humans are?

To whatever extent AI can eventually displace nonfiction content in the sciences and liberal arts, which is more indisputably useful, the same argument will apply; however, it applies to fiction content first, because current AI models have more trouble with facts (trouble in the form of hallucinations, for instance). The threat in non-fiction areas of creativity is more distant than the threat in fictional areas of creativity which are not as bound to reality.


> I think most people who have thought about it understand the impact AI models seem destined to have on writing

Go back to the beginning of social media, and tell me that "most people who had thought about it had understood the impact social media would have on society". It is really not a given. And that is my criticism: we see from history that it is not straightforward to understand the impact of new technology, but we engineers keep making the same mistakes over and over again.

> Copyright is not the correct tool to address it

Maybe not, that's right. I don't think anyone disagrees. The issue - at least from the point of view of artists - is more that some people (including authors and artists) want the problem addressed, and others (including engineers) just want to make money with their new toy and don't care much about addressing the problem.

> doesn't that imply that AI is better at generating useful entertainment than humans are?

I don't think so, no. It is maybe economically more successful, but I think it is clear that what is good for the economy is not necessarily good for society.

> however, it applies to fiction content first

Well... that is ignoring all the black hat use-cases, going from phishing to political mass manipulation, I would say :-)


Why didn't you just say that, instead of posing a hypothetical about software that may itself contain full book text which can be used to display (in this case fair-use) passages to end users? lol

I think the disconnect between your point of view and mine is that I see "training an LLM on copyrighted text" the same as a person reading copyrighted text, which is perfectly legal. And I see violating copyright as a person or LLM reproducing copyrighted work (illegal). But using other works as inspiration for something novel shouldn't be considered illegal, whether a person or LLM produced the work. I would even be fine with literature being treated more like music, where reproducing the essence of a piece of work (i.e. doesn't have to be a word for word reproduction) is considered a violation. But if the LLM creates something completely new, how is that a derivative work / infringement?


> Why didn't you just say that, instead of

Because I answered to a post that was talking about drawing the line for fair use. I just shared my view of how I see it. To me, OpenAI should be responsible for not giving copyrighted material to users if they are not allowed to do it. This means that they should be sued every single time someone manages to extract what is considered as copyrighted material from their software. Because the authors never gave them that right.

You Google Books example is different: the most obvious difference between that Google Books does not pretend that it is their content: they clearly say "here is a passage of this book".

> I see "training an LLM on copyrighted text" the same as a person reading copyrighted text

Yes, I think that is the main discussion point around LLMs. My point is that machines are not humans, and therefore they should not be blindly treated like humans. We should think about the consequence of the machines doing what they do, and decide whether that is legal or not in our society. Otherwise we would give machines the right to vote ("humans can vote, I don't see why machines couldn't").


> This means that OpenAI should be sued every single time someone manages to extract what is considered as copyrighted material from their software.

I agree! If GPT4 is outputting copyrighted material beyond what is considered fair-use (i.e. substantively more than what is provided by say, google books), I agree that is copyright infringement.

Indeed it is about the output, and making stuff available that people would otherwise have to pay for (or more precisely, enough of the copyrighted work that a person would have reason not to pay for the original work, causing a material loss to the original author) - that is a fineable violation imo.

Something else to think about... I work in biotech and have published articles in scientific journals on cellular and molecular level disease sequelae (such articles are also protected by copyright). Models trained on scientific literature are now being used for novel drug discovery and disease treatment pathways. These models are already outputting suggestions that seem very promising. Shall we also not provide these models access to the full corpus of scientific literature? It would significantly handicap these models to not have access to copyrighted scientific works. On one hand, some proportion of researchers will retain their jobs that would have otherwise been outsourced to LLMs (perhaps even myself). On the other hand, some amount of future patients will suffer or die from a disease that would have otherwise been cured.


That actually brings another point: if you train LLMs on scientific papers, at least in some domains it will make it easier to write a lot of papers. I am not an academic, but it is already my impression that there are a lot of low-quality papers out there. What if now many more get generated by LLMs? Won't that be a problem?


The low quality problem with primary research publications is not the writing but poor experimental design, misrepresenting experimental results, shoddy statistical analysis, and putting null results into file cabinets. Summarizing research findings isn't the crux of the problem, so if anything if an LLM can help the author perform a clearer and more concise writeup I'd see it as a net benefit.


One can use Photoshop to create heinous things which would be highly illegal to sell, or even show. Should we ban Photoshop? I think we agree, that no, that would be silly.

Let's assume I train a model on public domain texts only, zero copyrighted material has gone into it. Yet, technically and also practically, the model would have the ability to create illegal, copyright infringing content, given the right input. Should distribution of such a model be illegal? I think not.

Also, let's keep in mind that these models are not archives, that contain the original data verbatim. They are effectively lossy compression algorithms, that capture the essence. Some models are a few tens of GB in size, yet have been trained on TB worth of data, so clearly a significant data loss has occurred, so it's a bad analogy to your "keep a bunch of books as in a library" example.


> One can use Photoshop to create heinous things which would be highly illegal to sell, or even show. Should we ban Photoshop? I think we agree, that no, that would be silly.

This is a different matter, just like saying "we don't make cars illegal even though cars can be used illegally". IMO it is off topic in the context of my post above (or you need to elaborate).

> Let's assume I train a model on public domain texts only, zero copyrighted material has gone into it.

Do you genuinely not see the difference, from the point of view of the copyright holders? If OpenAI could build a version of ChatGPT without using any copyrighted material at all, I agree with you: it should not be made illegal. But the illegal use of it should be. I guess we agree here, but to me that is quite different from training ChatGPT with copyrighted material. The first difference being... well... good luck training ChatGPT without copyrighted material (probably now it's too late anyway, because weights of models trained from copyrighted material are just all over the Internet).

> Also, let's keep in mind that these models are not archives, that contain the original data verbatim. They are effectively lossy compression algorithms, that capture the essence.

Say I creat excutable dat do som kind of loss cmpression tht captred essnce.

First, do you agree that the sentence above is "some kind of lossy compression that captures the essence"? If yes, would you consider it legal for me to use that algorithm on famous books and sell them under my name, or would you think that I abused the copyright of the original material?


A car has the potential to be used for illegal activities. In fact happens every second, but that does not mean we go out of our ways to make them illegal.

That said, it is still an open question if the content generated by LLM is illegal. It only generates content that is inspired by some authors work, it is not recreating their work, only their style. Humans copy each other all the time in writing, music, art, you name it.


Either you completely missed my point, or you genuinely believe that my first executable is legal.

Which one is it?


I do not buy the premisses. Your example states that it can provide the full text of any book. An LLM cannot do that. They can produce something in the same style and setting. When an actual human author mimics other writers styles, it is not illegal, why exactly should it be illegal for an author to use an LLM to do it?


> When an actual human author mimics other writers styles, it is not illegal, why exactly should it be illegal for an author to use an LLM to do it?

There is a fundamental difference of scale. Say I write a blog post about some technical thing I know. You read it, learn from it (and other sources), and then you write your own blog post with your understanding. You may link to my post (if you believe it is heavily inspired from there), or not. But reading my post, understanding the material to the point where you can write about it took time. If you are so fast that you can produce more posts than I do, and your posts are the same quality, then suddenly you become a better blogger than me.

An LLM can automate that. With an LLM, you as a blogger don't have to learn from my post, you can just have it generate the post (you don't even need to read it). In that case, you have put absolutely zero effort learning about the topic, you just had some algorithm copy-paste-and-modify my post. Don't you think that if you publish a blog post that you haven't even read yourself, probably it comes from material written by somebody else at some point?

In any case it is very different. Once a human does it, the other time a machine does it. I don't understand why engineers think we can just transfer concepts between both. You as a human have rights (maybe you can vote), but your LLM cannot. Why couldn't an LLM vote if humans vote? Precisely because it is different, and it creates new problems.

The fact that LLMs can automatically (and at scale) launder copyrighted material is a new problem. Therefore we need to think about that, and to think about whether or not this should be legal.

The day people stop reading my blog and instead use ChatGPT (which is trained using, among the ton of other copyrighted material, my blog), and as a result nobody reads my blog directly anymore, I can assure you that I will stop writing. If 99% of books are written by ChatGPT instead of actual authors, but based on books from actual authors, in such a way that the readers don't really make a difference and therefore go with ChatGPT (because it's cheaper), what do you think writers will do? If they can't live from writing books (because ChatGPT does not buy their books, and people now pay ChatGPT instead of buying the books), then they will just stop writing.

It is a risk that needs to be considered. And it's certainly not an engineering question.


I do not disagree with any of what you wrote. That is also an entirely different line of reasoning than your first argument.

That said, LLMs today, cannot do this in a meaningful way. If an author cannot write a better book than ChatGPT, then that author would not be able to live of their writing anyway. And the authors that use ChatGPT to write a book, but still put the effort into fine tuning it, will not be able to this at scale. You also need someone to line out the plot and twists and turns if it is to be a full length book.

Let's assume that in 10 years, LLMs are at the point where you cannot distinguish between a well written book by an author and one generated by entirely by an AI. Suppose we have two authors, one that is long dead and whos works are public domain and a young one that is just starting. An LLM trained on the author whos work is in the public domain can generate books that is just like the original works. But what if the young author writes in a similar way, is that now legal or illegal to generate the same content? It's impossible to know if the young authors work have been used for training.

My take on it, is that LLMs are pretty stupid. They cannot come up with new and novel things. So if a writer writes something that is different (i.e. new and novel), how do we protect that? We cannot prevent it from being used for training, so the next logical step is to protect it the same way as we protect technology with patents. But that come with its own class of problems, say if two people write the same way independently, only one can have the right to do it. That is not the solution either.

I do not have the answer, but I am certain that trying to ban LLMs, or dictating what and how is not the answer. Perhaps the authors that can write in a new and novel way, and knows how to use AI will proliferate because they embrace it.


You didn't respond to this

> Let's assume that in 10 years, LLMs are at the point where you cannot distinguish between a well written book by an author and one generated by entirely by an AI. Suppose we have two authors, one that is long dead and whos works are public domain and a young one that is just starting. An LLM trained on the author whos work is in the public domain can generate books that is just like the original works. But what if the young author writes in a similar way, is that now legal or illegal to generate the same content? It's impossible to know if the young authors work have been used for training.

which is central to my reasoning. You could go the same way as software patents, but that is not preferable in any way.

---

>> I am certain that trying to ban LLMs, or dictating what and how is not the answer.

> I wouldn't ban LLMs because of copyright issues, though I would let authors choose whether their IP can be used for training or not.

> Why not? Just say that using for training is considered derivative work, and that's it. Now copyright owners just have to update their license to allow for training if they want to, and that's solved. Of course, Big Tech makes less money from that scenario.

Big Tech can train on everything that is "legal" and malicious actors can finetune with a specific authors works and then generate books. You will not be able to detect that and the malicious actor can claim to have written it themself. Then we're back to the starting point.


> We cannot prevent it from being used for training, so the next logical step is to protect it the same way as we protect technology with patents.

Why not? Just say that using for training is considered derivative work, and that's it. Now copyright owners just have to update their license to allow for training if they want to, and that's solved. Of course, Big Tech makes less money from that scenario.

> I am certain that trying to ban LLMs, or dictating what and how is not the answer.

I wouldn't ban LLMs because of copyright issues, though I would let authors choose whether their IP can be used for training or not.

However, copyright is only one issue with LLMs. All the black hat use-cases are a whole other category of issues. And I am of the opinion that technology is not neutral: IMO, it is perfectly fine for a society to ban a technology if it believes that it is globally doing more harm than good.


> Say I make a tool where you can enter the title of a book, and get the full text of the book without paying for it.

Let me introduce you to the Library of Babel[1].

But you need to know the hex! you complain. But that's basically how all of the "AI outputs copyrighted works!!!" gimmicks work. They're impractical unless you know exactly what you want it to reproduce. You can't just casually pick up a copy of Harry Potter like you would in a real library.

So is the Library of Babel illegal? What's the difference?

[1]: https://libraryofbabel.info/browse.cgi


Curiously that is a pretty close description of a normal neighborhood library.


Pretty close, and yet so different.

Don't you agree that I am allowed to buy a physical book and then resell it, but that I am not allowed to buy a physical book, make copies of it, and sell those copies?


The doctrine of first sale would never be allowed today if it hadn't already been enshrined in law over 100 years ago. Stakeholders would insist that creators wouldn't create if people could fund community libraries and thereby shrink the book market. They'd demand the same thing for physical books that they've implemented for almost all ebooks: a pay-per-lend system. The effect would be a substantially higher cost for each loan-out. Congress would go along, just as they went along with a new copyright term of min(120, life+70) years.

In the U.S., in 1909 when the doctrine of first sale was codified, the copyright term was 28 years + 28 years upon renewal, and copyright had to be registered. That would never be accepted by copyright interests today, either.


Great example! I think your library wouldn't be illegal itself, but distributing it would be illegal.


There's also the much more complex part of other jurisdictions. Japan already has largely said everything is fair game in AI.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: