Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why is it fair use (hashed out in court already) for google to copy every book they can get a hold of and store the full pages and use them to create their n-grams data and presumably to train their ai, but not for this company?

If they had bought each book themselves would it be fair use? So this is only about the piracy?



> If they had bought each book themselves would it be fair use? So this is only about the piracy?

The earlier ruling covered exactly that question:

- Anthropic downloaded many books (from LibGen and elsewhere). This piracy is what the current case is about, and is unrelated to the training.

- Separately, Anthropic bought and scanned a million used books. They trained the AI on this data. This was ruled as fair use, and is not involved in the current case.


That's very interesting, because it totally makes sense legally, but the practical effect is ludicrously stupid. The law is effectively forcing companies to spend millions re-scanning the same books over and over for no reason. It'd be like if we had a law which stated "Before you can train an AI, you must light 1 million dollars on fire. After that you can do whatever you want.". It serves no purpose but to waste societal resources on nothing.


> The law is effectively forcing companies to spend millions re-scanning the same books over and over for no reason

Would anyone agree if you replaced companies with people in that argument?

Why shouldn't a company follow the same rules as everyone else just because the scale at which they're doing it is so large?

I'd argue a company doing something like this should be forced to buy the books NEW and benefit the authors, and if they're found guilty of copyright infringement they should be punished at a scale a few orders of magnitude larger than an individual would be.

> Before you can train an AI, you must light 1 million dollars on fire

If I want to train an AI, I probably need to spend a larger part of my budget as an individual to do so than an org, should I be given the resources for free or severely discounted because I want to make money out of it?

I suppose one _could_ argue in favour of such a practice if it was going to benefit society as a whole, but is it?


I'm not saying companies should follow different rules than people, I'm saying the rules as written make no sense. This particular example just happens to make that fact more readily apparent due to the sheer scale of the needless waste involved.


I'm anti-DRM myself, but someone else could argue that the rules are partly doing their job; preventing companies from just gobbling up digital copies, it just happens that they have he resources to take advantage of a loophole by scanning the books in themselves.

The best solution I can come up with would be a digital library where one org, say the internet archive has scanned everything once, then they're charge a licence fee to these orgs to ingest a copy, and the part of the payment goes to the author, no big wastage, the information gets archived and the orgs pay their share.


I’ve been thinking recently that an overhaul to the copyright system could solve this. Return to a very low default (10 years? 20?). Allow extensions but a requirement for extension is submitting the work to a government managed digital data set that is licensed out to people to use as training data for these sorts of systems (or anything else a massive digitized cataloged library could be useful for). Licensing is some nominal amount of money and the revenue from that is distributed to copyright holders who have submitted their works in proportion to the recency and volume of content (with some cap to avoid flooding the system with content just to get more payouts.

I’m sure there’s lots of unintended problems with this, but it does feel like a common base set of training data like this is exactly the sort of thing the government can and should do.


The law isn't forcing people to do this, economics are. Nothing about the law forces people to use physical books, just that they actually pay for the books instead of stealing them. The company thinks they can get away with this cheaper than negotiating for a digital copy of the book, so that's what they are doing.


It needs to hook into the existing legal book supply chain so that authors could potentially get compensated (I doubt they do for used book resale tho..).


> The law is effectively forcing companies to spend millions re-scanning the same books over and over for no reason.

Oh but the reason is that they're now making $3 billion/year, partially because of those books. I see an argument for the inefficiency behind having to rescan books that are already scanned, but not the cost. If there was a way to buy pre-scanned books from Google Books or whatever then I somewhat see where you're coming from.

I argue that there were positive effects of Anthropic having to buy and scan physical books:

* The choices people made choosing which physical books to buy and scan helped make Claude what it is. Personally I sense a difference between Claude and OpenAI and Gemini, and part of it comes down to the choices they made in training material. Sorry to go on and on, but how many choices here were made because it was a rainy day and the trains were down, so an intern went to bookstore A instead of bookstore B?

* While buying the books used didn't help the authors it helped the struggling bookstores selling their books. Literal dollars into the hands of local workers. When I fast forward to today and see how LLM companies are literally stealing the energy from the communities their data centers are based in, and polluting them with shitty power plants I can at least think of that as one positive outcome, even if it only happened once.

As far as the 7 million+ books Anthropic didn't pay for, their series B in 2022 brought in $580 million. They could have afforded those books.


Why would they need to "re-scan the same books over and over"? It's as simple as they can use the books to train their AI if they bought them.


Because company A needs to scan the books, then company B wants to train their AI so they need to scan the same books, then company C wants to train their AI so they need to scan the same books... etc.

It would be one thing if they were buying "used" digital copies of the books, but the fact that this is only legal with scanned physical copies makes it extremely wasteful.


I don't have any sympathy for big orgs, they can follow the same rules as the rest of us, and should be slapped even harder for this than an individual accused of the same thing, however, I'm curious, why can't they buy digital copies in the first place?

Is there some nuance to the law that allows them to scan/copy them if they're physical but not if they're digital?


Not every book is readily available as a digital copy. Things like textbooks, older technical books or just books that weren't too popular can be easier to source as physical books and scan destructively.

A lot of digital copies are also DRM'd to shit - to obtain raw text usable for AI training, you'd have to break DRM. Which isn't that hard, on a technical level - but DMCA exists.

DMCA is a shit law that should have been dismantled two decades ago - but as long as it's around, bypassing DRM on things you own can be illegal. Scanning sidesteps that.


I'm anti-DRM personally, but I suppose in this case we could argue it's serving its purpose, it's just that workarounds have been found in the form of scanning physical books.

If no physical copies existed and there were only DRMd digital copies of everything, the companies scanning books for AI training would be forced to work out some deal with the DRM-overlords to have it removed for their use. That (I think) would be a net benefit as hopefully the authors would get paid too.


You say you're anti-DRM but that sounds like a very pro-DRM stance to me.

Within the bounds of personal use, copyright holders should have no say over what people do with media after it is sold. That goes equally when the entity that buys the media is a company rather than a person. The entire reason DRM is a problem is that it subverts that principle using technical means.


That wasn't a stance, it was a hypothetical.

I'm totally in agreement with you, once we buy something, it should be ours to do with as we wish, company or person. DRM is the sketchy technical solution that doesn't really solve a technical purpose, it's easily broken, but serves a legal one; the act of breaking it is the legal issue.

I make my stance by avoiding buying DRMd content where possible; DRM free games and digital books, but it's not always possible to avoid, if I buy a BD, I can't rip it to my NAS without subverting the DRM.

Linux is also the only OS running in my home (on computers with screens and keyboards) so I mostly can't even legitimately play those DRMd things if I buy them, whether it's a BD, or Netflix in my web browser, or whatever else if I wanted to.

I'm very, very much anti-DRM.

EDIT: Typo


> Within the bounds of personal use, copyright holders should have no say over what people do with media after it is sold. That goes equally when the entity that buys the media is a company rather than a person.

how can a company be covered under personal-use?


The problem is the transition from analog to digital. It is entirely legal, for an entity to buy a physical book, and then loan those books out, aka a library. That entity is free to charge money, or might even be a part of the local government. But see, copyright is a thing that was invented in the first place. Why should it even exist in the first place? Of course we can't argue with the fact that the world we are in has copyright, but in countries where there is less copyright protection, it doesn't seem like the sky has fallen there either. We want to promote science and useful arts and incentivize creation. It's supposed to be a temporary monopoly granted by the government before works fall into the public domain. Originally 14 years, with another 14 years if the author was alive. We should absolutely do what we can to encourage science and the arts, but Disney's managed to take it way further than it was originally specified for.


Given that "training AI on books you own" was ruled fair use, the "purpose" DRM is serving here is preventing fair use.

Which is the kind of thing you would expect it to do.


It would probably be legal with digital copies; it's just that book publishers have been very zealous in preventing the existence of a market for used digital books.

Copyright has been very silly in the digital realm from the beginning and is unlikely to get less unhinged from reality absent a major overhaul that makes it completely unrecognizable.


Digital media, in particular its resale, is the one good use case for blockchains that no one seems to be interested in (and don't provide me a link of some obscure project working on it; what can I buy with it?). Probably because it's useful for consumers but not for making money.


Or they could just pay the authors of the books directly for a license... Isn't that kinda like how a lot of software companies are compensated?


The cost might reduce the number of entities who can afford to do it, though, which would reduce the amount of abuse.


> Before you can train an AI, you must light 1 million dollars on fire.

I mean, demanding you pay money to the source of data in your quest to create a monopoly you are pretty much guaranteed to abuse later on while becoming filthy rich is not exactly unfair.


I mean I don't think the law precludes them legally purchasing ebooks.


"no reason"? Try telling that to the people who wrote the books.


They're not getting anything out of the small bump in resale prices well after they wrote and sold those books.


The process of repeatedly re-scanning used books benefits the authors? how?


The way I understand the case, yes. If Anthropic had just 1 copy of each book that would be permitted. Apparently they at some point bought copies of most books then shredded them. I'm not entirely sure how that makes any difference, but it was done after the piracy.


At some point this starts to feel like witchcraft spells.


Machine learning is 20% programming, 30% math and 50% demon summoning.


That may be so, but this is really more about the law.

So, y'know, probably more like 80% demon summoning.


> Apparently they at some point bought copies of most books then shredded them.

An awfully convenient explanation of what went down. Gives some good "dog ate my homework" vibes.

The double standard vs google etc. is of course despicable


they would target google too

but google money is so big that they could just buys the entire thing + publisher

Google selling books as a publisher on playstore so it can get away with it


With n-gram it isn't just repeating the exact content of the book back to you the way the book itself does, which is what AI does, it's a statistical overview of the words from many books. Knowing what year use of the word "slouch" peaked in print isn't any kind of substitute for reading a book that uses the word.


“which is what AI does” what? The term “statistical overview of the words from many books” is a great way to describe an LLM. It’s not like the weights encode every book verbatim.


>repeating the exact content of the book back to you the way the book itself does, which is what AI does

Are you sure about that?

Maybe on a good day you'll get a paragraph, but getting a few pages equivalent to a "book preview"? No shot.


There's a 'first sale doctrine' in the US. Once you purchase a book, you can make a copy of it for yourself. Same with an album or game.

"Copy" right. Get it? Right to copy.


True. But you have to treat the thing you bought and all copies as an indivisible unit. If you sell or give away the thing or any of its copies, you have to include (or destroy) all other copies of that thing.


Yep. Google shouldn't have won that suit, just like Marvin Gaye should not have won his suit against Bruno Mars.

The courts don't always make rational decisions. They're dumb and corrupt.


Yes, only about the piracy.


N grams doesn’t threaten anyone’s business. To have a lawsuit you need damages.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: