Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few sentences? For most books I've seen, it's a few pages. Google will block you from retrieving more pages from the same book eventually. Using a VPN and a different account may get around one limit, but I experimented with multiple VPNs and browsers once, and although I was able to get a majority of a book's pages, after that google stopped showing me full previews of any of the remaining pages no matter where the request came from.


It shows you a few pages if you are previewing the book (i.e. "look inside"). But if you are using search, it will show you where your search query shows up in the book, no matter what page the search query is found. This means you could theoretically search a book sentence by sentence, and it will eventually have shown you the entire book. I'm not claiming this is an efficient or practical way to game the system and read books, only that google books does contain the full copy of the book text and can reveal the contents of any passage. This is basically how Prosecraft works (at least what i glean from the article) - it doesn't let you read a whole book, even though it may contain a representation of the full text.


Sure. I really did not mean that specifically for Prosecraft. But the article questions why authors are attacking Prosecraft "because it does no harm". My answer is that authors don't (and can't, really) make the difference in a per-case basis.

At this point what they see is that LLMs trained on their copyrighted material are able to generate similar material thanks to their copyrighted material that was used in the training (that is important!), and they see that they won't get paid for that. Of course they are scared, and they should be. And of course they will now start attacking everything that looks like it is using their copyrighted material as training data.

I really don't get why the engineering world does not get this: LLMs have the potential to ruin people's jobs, it is not clear at all that this is legit (IMO LLMs could not do it without the copyrighted material they used for training, therefore they are derivatives of the original work), and those people are rightfully scared.


I think most people who have thought about it understand the impact AI models seem destined to have on writing (and digital 2d art, soon music, and later other things). In addition to writers and voice artists panicking, see the Hollywood strikes, for instance, and what's currently happening in the corporate world to digital artists.

Copyright is not the correct tool to address it. In the U.S., the basis for copyright law is a clause in the constitution which allows Congress to pass laws to secure for limited time creators' exclusive rights to their creations, for the purpose of "promot[ing] the progress of science and useful arts."

If AI is going to displace nearly all of that content, at least content that's mostly entertainment and less educational (novels, music, images, eventually movies), doesn't that imply that AI is better at generating useful entertainment than humans are? Aren't hypothetical laws restricting training of AI on past works to protect the market for future human-created works a direct violation of the constitution, if AI is better at generating "useful" entertainment (entertainment that people will buy) than humans are?

To whatever extent AI can eventually displace nonfiction content in the sciences and liberal arts, which is more indisputably useful, the same argument will apply; however, it applies to fiction content first, because current AI models have more trouble with facts (trouble in the form of hallucinations, for instance). The threat in non-fiction areas of creativity is more distant than the threat in fictional areas of creativity which are not as bound to reality.


> I think most people who have thought about it understand the impact AI models seem destined to have on writing

Go back to the beginning of social media, and tell me that "most people who had thought about it had understood the impact social media would have on society". It is really not a given. And that is my criticism: we see from history that it is not straightforward to understand the impact of new technology, but we engineers keep making the same mistakes over and over again.

> Copyright is not the correct tool to address it

Maybe not, that's right. I don't think anyone disagrees. The issue - at least from the point of view of artists - is more that some people (including authors and artists) want the problem addressed, and others (including engineers) just want to make money with their new toy and don't care much about addressing the problem.

> doesn't that imply that AI is better at generating useful entertainment than humans are?

I don't think so, no. It is maybe economically more successful, but I think it is clear that what is good for the economy is not necessarily good for society.

> however, it applies to fiction content first

Well... that is ignoring all the black hat use-cases, going from phishing to political mass manipulation, I would say :-)


Why didn't you just say that, instead of posing a hypothetical about software that may itself contain full book text which can be used to display (in this case fair-use) passages to end users? lol

I think the disconnect between your point of view and mine is that I see "training an LLM on copyrighted text" the same as a person reading copyrighted text, which is perfectly legal. And I see violating copyright as a person or LLM reproducing copyrighted work (illegal). But using other works as inspiration for something novel shouldn't be considered illegal, whether a person or LLM produced the work. I would even be fine with literature being treated more like music, where reproducing the essence of a piece of work (i.e. doesn't have to be a word for word reproduction) is considered a violation. But if the LLM creates something completely new, how is that a derivative work / infringement?


> Why didn't you just say that, instead of

Because I answered to a post that was talking about drawing the line for fair use. I just shared my view of how I see it. To me, OpenAI should be responsible for not giving copyrighted material to users if they are not allowed to do it. This means that they should be sued every single time someone manages to extract what is considered as copyrighted material from their software. Because the authors never gave them that right.

You Google Books example is different: the most obvious difference between that Google Books does not pretend that it is their content: they clearly say "here is a passage of this book".

> I see "training an LLM on copyrighted text" the same as a person reading copyrighted text

Yes, I think that is the main discussion point around LLMs. My point is that machines are not humans, and therefore they should not be blindly treated like humans. We should think about the consequence of the machines doing what they do, and decide whether that is legal or not in our society. Otherwise we would give machines the right to vote ("humans can vote, I don't see why machines couldn't").


> This means that OpenAI should be sued every single time someone manages to extract what is considered as copyrighted material from their software.

I agree! If GPT4 is outputting copyrighted material beyond what is considered fair-use (i.e. substantively more than what is provided by say, google books), I agree that is copyright infringement.

Indeed it is about the output, and making stuff available that people would otherwise have to pay for (or more precisely, enough of the copyrighted work that a person would have reason not to pay for the original work, causing a material loss to the original author) - that is a fineable violation imo.

Something else to think about... I work in biotech and have published articles in scientific journals on cellular and molecular level disease sequelae (such articles are also protected by copyright). Models trained on scientific literature are now being used for novel drug discovery and disease treatment pathways. These models are already outputting suggestions that seem very promising. Shall we also not provide these models access to the full corpus of scientific literature? It would significantly handicap these models to not have access to copyrighted scientific works. On one hand, some proportion of researchers will retain their jobs that would have otherwise been outsourced to LLMs (perhaps even myself). On the other hand, some amount of future patients will suffer or die from a disease that would have otherwise been cured.


That actually brings another point: if you train LLMs on scientific papers, at least in some domains it will make it easier to write a lot of papers. I am not an academic, but it is already my impression that there are a lot of low-quality papers out there. What if now many more get generated by LLMs? Won't that be a problem?


The low quality problem with primary research publications is not the writing but poor experimental design, misrepresenting experimental results, shoddy statistical analysis, and putting null results into file cabinets. Summarizing research findings isn't the crux of the problem, so if anything if an LLM can help the author perform a clearer and more concise writeup I'd see it as a net benefit.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: