Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I'm against Anthropic stealing teacher's work and discouraging them from ever writing again. Some teachers are already saying this (though probably not in California).
> The problem with this thinking is that hundreds of thousands of teachers who spent years writing great, useful books and sharing knowledge and wisdom probably won't sue a billion dollar company for stealing their work. What they'll likely do is stop writing altogether.
I think this is a fantasy. My father cowrote a Springer book about physics. For the effort, he got like $400 and 6 author copies.
Now, you might say he got a bad deal (or the book was bad), but I don't think hundreds of thousands of authors do significantly better. The reality is, people overwhelmingly write because they want to, not because of money.
No you don't see where I am coming from. And my father was a university professor. I am certainly not opposed to authors being fairly remunerated for their work, that's why I brought up that example.
My point is, the controversy is not an AI corporation vs 10^5 ordinary teachers. It's a battle of two corporations, or business models, if you will. But regardless of the result, most of the book authors will continue to get screwed, maybe the means will change. But it will not prevent them from writing, either. So I don't see any mass writers protests coming, sorry.
I also don't think Anthropic AI is going to be any less intelligent if it didn't read any modern fiction book, instead of reading a Wikipedia summary. Stories and myths are a human way of understanding the world, machines probably don't need them. And for non-fiction books - there really isn't that many irreplaceable high-profile authors out there. If it can't read, say, Feynman's Lectures on Physics, it can learn the same from 100s of other physics textbooks. Maybe they are slightly worse organized but why should superintelligence care?
Training a generative model on a book is the mechanical equivalent of having a human read the book and learn from it. Is it stealing if a person reads the book and learns from it?
Well, right, but that's different from "can reproduce the original work". I "can" start typing out song lyrics but it doesn't mean that I stole the songs I've listened to.
That will be sad, although there will still be plenty of great people who will write books anyway.
When it comes to a lot of these teachers, I'll say, copyright work hand in hand with college and school course book mandates. I've seen plenty of teachers making crazy money off students' backs due to these mandates.
A lot of the content taught in undergrad and school hasn't changed in decades or even centuries. I think we have all the books we'll ever need in certain subjects already, but copyright keeps enriching people who write new versions of these.
Yeah, people will still want to write. They might need new ways to monetize it... that being said, even if people still want to write they may not consider it a viable path. Again, have to consider other monetization.
They won't be needed anymore, once singularity is reached. This might be their thought process. This also exemplifies that the loathed caste system found in India is indeed in place in western societies.
There is no equality, and seemingly there are worker bees who can be exploited, and there are privileged ones, and of course there are the queens.
> They won't be needed anymore, once singularity is reached.
And it just so happens that that belief says they can burn whatever they want down because something in the future might happen that absolves them of those crimes.
Note: My definition of singularity isn't the one they use in San Francisco. It's the moment founders who stole the life's work of thousands of teachers finally go to prison, and their datacentres get seized.
When the rich and powerful face zero consequences for breaking laws and ignoring the social contracts that keep our society functioning, you wind up with extreme overcorrections. See Luigi.
How extreme is that, really? Not to justify murder: that is clearly bad. But "killing one man" is evidently something we, as a society, consider an "acceptable side-effect" when a corporation does it -- hell, you can kill thousands and get away scot-free if you're big enough.
Luigi was peanuts in comparison.
“THERE were two “Reigns of Terror,” if we would but remember it and consider it; the one wrought murder in hot passion, the other in heartless cold blood; the one lasted mere months, the other had lasted a thousand years; the one inflicted death upon ten thousand persons, the other upon a hundred millions; but our shudders are all for the “horrors” of the minor Terror, the momentary Terror, so to speak; whereas, what is the horror of swift death by the axe, compared with lifelong death from hunger, cold, insult, cruelty, and heart-break? What is swift death by lightning compared with death by slow fire at the stake? A city cemetery could contain the coffins filled by that brief Terror which we have all been so diligently taught to shiver at and mourn over; but all France could hardly contain the coffins filled by that older and real Terror—that unspeakably bitter and awful Terror which none of us has been taught to see in its vastness or pity as it deserves.”
> 150K per work is the maximum fine for willful infringement
No, its not.
It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.
Of course, there's also a very wide range of statutory damages, the minimum (if it is not "innocent" infringement) is $750/work.
> 105B+ is more than Anthropic is worth on paper.
The actual amount of 7 million works times $150,000/work is $1.05 trillion, not $105 billion.
> It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.
Yeah, you’re probably right, I’m not a lawyer. The point is that it doesn’t matter what number the law says they should pay, Anthropic can afford real lawyers and will therefore only pay a pittance, if anything.
I’m old enough to remember what the feds did to Aaron Schwarz, and I don’t see what Anthropic did that was so different, ethically speaking.
Even if they don't qualify for willful infringement damages (lets say they have a good faith belief their infringement was covered by fair use) the standard statutory damages for copyright infringement are $750-$30,000 per work.
Isn't "pirating" a felony with jail time, though? That's what I remember from the FBI warning I had to see at the beginning of every DVD I bought (but not "pirated" ones).
A court just ruled on Anthropic and said an LLM response wasn't a form of counterfeiting (ie, essentially selling pirate books on the black market). Although tbf that is the most radical interpretation still being put forward by the lawyers of publishers like NYTimes, despite the obvious flaws.
What someone at Anthropic did was download libgen once, then Anthropic figured "wait a minute, isn't that illegal?" , so instead they went and bought 7 million books for real and cut them up to scan them.
Turns out this doesn't quite mitigate downloading them first. (Though frankly, I'm very much against people having to buy 7 million books when someone has already scanned them)
Just downloading them is of course cheaper, but it is worth pointing out that, as the article states, they did also buy legitimate copies of millions of books. (This includes all the books involved in the lawsuit.) Based on the judgement itself, Anthropic appears to train only on the books legitimately acquired. Used books are quite cheap, after all, and can be bought in bulk.
Buying a book is not license to re-sell that content for your own profit. I can't buy a copy of your book, make a million Xeroxes of it and sell those. The license you get when you buy a book is for a single use, not a license to do what ever you want with the contents of that book.
Yes, of course! In this case, the judge identified three separate instances of copying: (1) downloading books without authorisation to add to their internal library, (2) scanning legitimately purchased books to add to their internal library, and (3) taking data from their internal library for the purposes of training LLMs. The purchasing part is only relevant for (2) — there the judge ruled that this is fair use. This makes a lot of sense to me, since no additional copies were created (they destroyed the physical books after scanning), so this is just a single use, as you say. The judge also ruled that (3) is fair use, but for a different reason. (They declined to decide whether (1) is fair use at this point, deferring to a later trial.)
If you wanted to be legit with 0 chance of going to court, you would contact publisher and ask to pay a license to get access to their catalog for training, and negotiate from that point.
This is what every company using media are doing (think Spotify, Netflix, but also journal, ad agency, ...). I don't know why people in HN are giving a pass to AI company for this kind of behavior.
> I don't know why people in HN are giving a pass to AI company for this kind of behavior.
As mentioned in The Fucking Article, there's a legal difference between training an AI which largely doesn't repeat things verbatim (ala Anthropic) and redistributing media as a whole (ala Spotify, Netflix, journal, ad agency).
The paradigm is that teachers will teach life skills like public speaking and entrepreneurship. Book smarts that can be more effectively taught by AI will be, once schools catch up.
I agree. The world is changing fast, but we need to make the transition less painful for everyone. The way things are going now only benefits big tech.
This is not about paying for a single copy. It would still be wrong even if they have bought every single one of those books.
It is a form of plagiarism. The model will use someone else's idea without proper attribution.
Legally speaking, we don't know that yet. Early signs are pointing at judges allowing this kind of crap because it's almost impossible for most authors to point out what part of the generated slop was originally theirs.
But should the purchase be like a personal license? Or like a commercia license that costs way more?
Because for example if you buy a movie on disc, that's a personal license and you can watch it yourself at home. But you can't like play it at a large public venue that sell tickets to watch it. You need a different and more expensive license to make money off the usage of the content in a larger capacity like that.
The judge appears to disagree with you on this. They found that training and selling an LLM are fair use, based on the fact that it is exceedingly transformative, and that the copyright holders are not entitled to any profits thereof due to copyright. (They also did get paid — Anthropic acquired millions of books legally, including all of the authors in this complaint. This would not retroactively absolve them of legal fault for past infringements, of course.)
The trial is scheduled for December 2025. That's when a jury will decide how much Anthropic owes for copying and storing over seven million pirated books
Yes, that would by an interesting trial. But it is only about six books, and all claims regarding Claude have been dismissed already. So only the internal copies remain, and there the theory for them being infringing is somewhat convoluted: you have to argue that they are not just for purposes of training (which was ruled fair use), and award damages even though these other purposes never materialised (since by now, they have legal copies of those books). I can see it, but I would not count on there being a trial.
The fallacy in the 'fair use' logic is that a person acquires a book and learns from it, but a machine incorporates the text. Copyright does not allow one to create a derivative work without permission. Only when the result of the transformation resembles the original work could it be said that it is subject to copyright. Do not regard either of those legal issues are set in concrete yet.
Both a human and a machine learn from it. You can design an LLM that doesn’t spit back the entire text after annealing. It just learns the essence like a human.
Morally maybe, but AFAIK machines "learning" and creating creative works on their own is not recognized legally, at least certainly not the same way as for people.
And the crazy thing is that might be cheaper when you consider the alternative is to have your lawyers negotiate with the lawyers for the publishing companies for the right to use the works as training data. Not only is it many many billable hours just to draw up the contract, but you can be sure that many companies would either not play ball or set extremely high rates. Finally, if the publishing companies did bring a suit against Anthropic they might be asked to prove each case of infringement, basically to show that a specific work was used in training, which might be difficult since you can't reverse a model to get the inputs. When you're a billion dollar company it's much easier to get the courts to take your side. This isn't like the music companies suing teenagers who had a Kazaa account.
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?