Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fear of AI just killed a useful tool (techdirt.com)
298 points by donohoe on Aug 9, 2023 | hide | past | favorite | 297 comments


I found this article frustratingly vague on how prosecraft.io actually worked. As far as I can tell, the author scraped the web for books, including in-copyright books. Then he analyzed it with techniques based on "classical" natural language processing techniques, rather than transformers or deep learning. He appears to have retained the books he scraped for future analysis. The site itself seems to use only snippets.

However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

For what it's worth, the Computational Story Lab's hendometer [1] seems to have largely out-of-copyright books from Project Gutenberg, plus the Harry Potter series.

[0]: https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

[1]: https://hedonometer.org/books/v3/863/

Edit: Apparently he was working on an LLM project. https://twitter.com/stealcase/status/1688721685585809408. It's unclear whether he was planning to use the books he scraped (although as @stealcase points out, GPT-Neox itself was trained on books that were pirated).


If he says he didn't do something, the pitchfork mob will simply tell each other that he is lying. They will do this in the most confused and twisted way possible, driven by lack of understanding of what was happening combined with a need to drive outrage and thereby advertise their work.

If he says he didn't intend to do that thing, this is still compatible with a later update that he didn't do that thing, but immediately dampens the outrage machine. The reader who knows nothing about either side relaxes -- "No need for me to get worked up, because we won". Conveniently, saying he didn't intend to do the thing is also compatible with a later reveal that he was doing the thing (perhaps for later release, since he wasn't clearly doing the thing here).

Therefore, regardless of whether he was doing what he was accused of doing, this is the lowest energy response, and probably the default unless this was the hill he wanted to defend.


The best response, for us all collectively, is to always ignore everyone's opinion online. There is zero value in anything on reddit, twitter, facebook, the media these days.

Just ignore it. All of it. Outrage or not.

I see downvotes, but I mean it. You know who you listen to? Your friends. Your neighbours. Your local community. You listen to PEOPLE, not sockpuppets. You listen to legitimate human beings, not AI generated blather, or curated news stories, or groups working together to generate hate, outrage, to stoke anger, upset.

You listen to actual, real PEOPLE.

You want to go to reddit? Twitter? Anything? Fine. But treat it as 100% fiction, pure entertainment, and never let it affect YOU.


> Your neighbours. Your local community. You listen to PEOPLE, not sockpuppets. You listen to legitimate human beings, not AI generated blather, or curated news stories, or groups working together to generate hate, outrage, to stoke anger, upset.

You either have a significantly better social circle than I do or are glossing over a bunch of nuance.

Some of my family back east have been getting their brains rotted by fox news and the likes for literal decades now.

Some of those family members were also shaped by insular groups (their 'community') with fucked up thinking

As for the friends who I choose to associate with today, I indeed like and prefer their opinions (probably why we've self selected to be friends), but I assure you they have nonzero exposure to online communication.

I agree Twitter is shit and always has been due to the lack of context, but I often prefer reddit to real life conversations because I can actually take my time to think about my thoughts and even dig up sources. It also exposes me to more diverse viewpoints than I'd get solely in my physical circle of friends.

Now, I don't really use reddit after the most recent shenanigans, but here I am back on hackernews like I was slashdot before reddit. I don't think I'd ever want to abstain completely from online communications or forums.


I said nothing of abstainment. In fact, I am posting!

Ignoring online comments, especially criticism, does not mean disregardment. And note, context is important. Note what I am replying to.

Simply put, on a medium where one person can sockpuppet appear as 1000, where one person can rally 1000 useful idiots with one disingenuous post, one cannot care what is said.

Ignore it.

We already have 30 year old adults, trying to discuss political nuance online, not realising that they may be piled on by a dozen 8 year olds. People presume the person behind the text is real, the person is their approximate age, or at least an adult, that the person is debating in good faith.

None of this is necessarily true, and in any large group of responses, the above chicanery is happening.

No one should care what a bunch of "people" on Twitter say.


I'm genuinely curious, why do you post here if you have this mindset?


This is not twitter, with its tiny little snippets of text, which are useless for meaningful communication, and its culture which incites groupthink. This is not Reddit, with its hostile, hate filled voting system, with its peer pressure laden culture. This is not Facebook, literally designed to drive hate, and anger, and upset, to increase engagement.

This is Hackernews. It's not perfect, but it's far more palatable. And it's certainly not like any of the above.

Lastly, my advice still applies. When I detect hate here, I ignore it. When I detect peer pressure, I don't care.


Your two comments here could each fit in a Tweet and Reddit has basically the same voting mechanism as HN. There is also plenty of groupthink here too. That was basically the reason behind my comment to see what distinction you are making. People here seem to often deride social media while pretending HN is not social media.

HN is effectively just Reddit with only a few subreddits and moderation that is stronger than average, but within the range you will find on certain subreddits.


And an ink quill and a ball point pen, both write ink.


Well, there is hate and then there is criticism, and lots of people confuse the two, to the point where they just label any critics as "haters". Hate is worth totally disregarding, but criticism, as long as it's genuine and done in good faith, it's probably valuable.


One of the problems is, if you're .. I don't know, say in your neighbourhood. Or maybe at a small local gathering. You do something which deserves criticism, or mild, but polite rebuke.

You hear it a few times.

Online, you may hear 100k of the same, and most of the politeness and empathy is gone. It's all about scale. About reach. And about the more of that there is, the less empathy there is.


I still remember the days when I saw Reddit as simple entertainment. in the last 10 years its reach has only grown. I identify with you sentiment, but have also become entrenched deeper inside the mindset of the online. The top comment tells the truth. The votes count. A judgement based on groupthink instead of individual discussion.

Many here don't agree, but I wonder if their attention to the beautiful divide between people has been obscured by groupthink.

4Chan /b/ is such a toxic community that I never spent long there without remembering the words that reflect its value at the top. >The stories and information posted here are artistic works of fiction and falsehood. Only a fool would take anything posted here as fact.


> However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

This seems FUDdy. "Intend" isn't in the apology at all, and the wording that is there says clearly that generative AI came after prosecraft, so there's no way the tool could be used for it.

> It's unclear whether he was planning to use the books he scraped

This also seems unwarranted. The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?


I apologize for the quotes around intend. I wrote it without, then I forgot it was a paraphrase and added them back again. Unfortunately, I cannot edit my comment to fix that.

I do think “intend” is a reasonable paraphrase of “never wanted to.”

(Edited to add) I don’t think prosecraft was a finished project and he was definitely still working on his other tool for writers that incorporates some of the same tools.

> The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?

The reason being that he had mentioned he was planning to use the scraped books for future analysis.


So you can edit your OP and this comment but can’t edit “intend”?


HN comments cannot be edited once two hours have past.


Even Facebook's Llama was trained on books3, a dump of pirated books.


It's so mind blowing to me that it made it past corporate legal. I don't get what defense there could be besides "lmao try and stop me, nerds"


"...but, but, that's illegal." Jenkins huffed out in an almost whisper.

"Jenkins, take out your wallet, and place it on the table. If you speak, I'll fire you on the spot. Do it." the CEO spat.

Jenkins trembled with rage but reached his right hand deftly into the inner folds of his suit, and produced his billfold. He waved it a bit and set it in front of him.

"Good, now push the package forward, and keep silent." the CEO growled at Jenkins.

Jenkins pushed his wallet out further from himself, and with a last shove pushed the wallet beyond his fingertips. Jenkins wiggled his fingers and sat back, raising his eyebrows at his bosses boss.

"Harold!" CEO Barcliff barked. "Take the package to logistics."

"Right away, sir." Harold nodded, retrieved the package and left the room.

"What the hell Barcliff!" Jenkins exploded.

"I suggest you call the card companies, and the DMV, for replacements." Barcliff stated in a deadpan low growl.

"Screw you Barcliff, call him back. I want my wallet. That's my life in there man!" Jenkins now yell-whined.

"Connely, Mirasu, Fender, Scotts, did any of you see this wallet Jenkins is whining about?" Barcliff almost sneered.

"No sir, I observed Jenkins deliver a package to Harold to be taken to logistics." Scotts spoke out before the others could speak. Nods all around though.

"You see Jenkins, illegal is only what you can prove. You can't prove shit. Just like they can't prove shit. Now get out of here before I fire your ass. Don't forget to say hi to the wife for me." Barcliff purred.


Fair use is basically the whole defense.


Fair use only applies to what you publish, so your new work B can be seen as not to infringe on the copyright of the original work A.

What about the part where a Meta employee used Meta computers to download copies of the books?


I am a bit confused about what's so outrageous about this tool. It seems that both the book authors, and some of the people in the discussion here, conflate rudimentary statistics about a book (number of words of certain kind) with the latest wave of generative AI. They are very different in both what value they provide, and what risk they pose to book authors.

The tool that book authors got outraged about only provides basic metrics, not dissimilar from other metrics such as "page count", and can't be used to produce new content which could deprive the book authors from revenue.


If you read through the angry Twitter thread it's clear that almost everyone thinks that either a) the site is a pirate site that lets you download books or b) that the site lets you generate works in the style of an author. Neither of which is true of course.

There are a handful (like < 3 people) who seem to understand what the site actually does who were still angry because the creator seems to have pirated the books. I actually don't know about the legality of something like that. Surely providing pirated books is illegal, but IDK if acquiring pirated books actually is.

I think it's clear though that most of the outrage would still be there even if the author had purchased each and every book.


> I actually don't know about the legality of something like that.

Techdirt's analysis of the legality seems correct to me. TL;DR is that it seems legal.


[flagged]


Having now looked up what the word priapism means, I'm now trying to imagine what moral priapism would be, and I can't think of a definition that makes sense. What is it? Is there a simpler word you can use?


GPT4 sez: "This sentence uses figurative language, so it doesn't refer to literal meanings. "Fear" is a stressful emotion induced by perceived danger or threat. "Acute" suggests a sudden onset or intensity. "Moral priapism," a more abstract concept, suggests an abnormally extended or excessive fixated morality, much like the medical condition priapism refers to an unwanted, enduring erection.

Overall, the sentence means that fear can lead to an exaggerated or obsessive moral response. This implies that when someone is afraid, they might stick rigidly to their moral code or make moral judgments more extremely or rigidly than they would in a state of calmness."

seems fair

Edit: My gloss of the summary: "Fear creates a hardon for scapegoats among the pitchfork wielders."


Permanently aroused to look for any injustice.


If you want to do this kind of thing, let authors opt-in (or publishers).

Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.

I suspect the authors are put-off by a couple things:

- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.

- the use of these scans in a commercial product

The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.


> If you want to do this kind of thing, let authors opt-in (or publishers).

If it's fair use, why should you have to do that? The same copyright law protecting author's ownership rights over their art also provide "fair use" to other people. Someone may disagree with current fair use laws (and I suspect many outraged here do not), but that's a broader issue not related to this particular tool. It just 100% seems like misdirected AI outrage.

> the text of the works scanned seems like it may be from pirated sources.

Do you have a source for this? I didn't see that mentioned in the article.


> Do you have a source for this? I didn't see that mentioned in the article.

The person who runs prosecraft says "I looked to the internet for more text that I could analyze, and I used web crawlers to find more books." [0]

I'm just inferring, but if they had, say, purchased each of these books, or borrowed them from the library, or only sourced from sites that ensure the copyright is satisfied, then they might have mentioned it.

(FWIW, the blog post says the other source for the 25K works was their personal library, so I'm assuming the bulk of the 25K come from the internet, though I know some people have prodigious personal libraries.)

[0] https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...


> If it's fair use, why should you have to do that?

You may not be legally required to do that, but it can be an excellent move that benefits you nonetheless.

Much like how Weird Al isn't legally required to get permission to make a parody of a popular song, but he does so anyway.

But in this case, I don't think you even need to invoke Fair Use. I think what he did simply isn't a copyright violation in the first place.

In reality, the legality of this was never the issue anyway. The issue was that doing this made the authors angry, and the dev didn't want that.


> also provide "fair use" to other people

"How much of someone else's work can I use without getting permission?

Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports."

https://www.copyright.gov/help/faq/faq-fairuse.html

Limited portions, not the entire work.


Quote from the OP:

> The Gizmodo article has a ridiculously wrong “fair use” analysis, saying “Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’” Except… it almost certainly does? Again, we’ve gone through this with the Google Book scanning case, and the courts said that you can absolutely do that because it’s transformative.

Relevant article: https://www.techdirt.com/2013/11/14/google-gets-total-victor...


That's ludicrous. It's counting words in a book. You can't copyright facts and that is all the tool is doing. Pages that are reproduced are only excerpts which falls squarely under fair use.

It's no different than you checking out the book from the library and counting all the words.


Copyright pertains to reproduction of the work. The statistics this tool provided are not reproductions at all. It did also provide quotes, which were not extensive and certainly not the entire work.


Limited potions can be reproduced In the derived work you are distributing. Summaries and statistics of the work are almost certainly fair use.


> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

Going off of some of the tweets about this that initially whipped up the outrage about this…it’s not like they were making a nuanced case about their concerns, they were basically just stomping their feet and shouting.


That's twitter generally.

If your engagement only reaches the level of twitter, you aren't really engaging at all.


So as long as that's all the engagement there is, we're free to ignore it and carry on, correct?


I would think so. If someone is shouting & stomping their feet in the public town square about my project, but I never go anywhere near the town square anyway, I don’t think I’m going to shutdown my project. It’s just too bad the person who created this tool happened to walk through the town square.


I think you're fishing for a way to dismiss the concerns of the authors without understanding or addressing them, which is pointless.


> If you want to do this kind of thing, let authors opt-in (or publishers).

"This kind of thing" is factual information about the book, such as page or word count, ly-adverb count, etc. Small snippets, something permissible under copyright law today, that were heavily editorialized and commented on were displayed.

To suggest that counting words and pages is something that should not be allowed is silly.

> The article itself is clueless…

Says the person making stuff up to force a narrative.

The person doing this had the rights to do this, and was very clearly within his rights to do this under copyright law. Counting words is not a crime.


> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

The authors quotes speak for themselves. They very clearly and ignorantly claimed that this was an "AI training project" when it was nothing of the sort.


> it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

https://twitter.com/scumbelievable/status/168915466478730444...

So the two authors who are gloating about "killed that stupid fuckin AI thing" - I'm supposed to be engaged with their concerns ? Please.


Statistical analysis is only useful if you have enough data to analyse, so there is in fact a threshold of number of books to cross before the tool can even really exist. If you read his post, the initial goal was to get stats about typical word count, typical amount of passive speech, etc. Requiring opt-in for these broad statistics, through outrage only since this project is CLEARLY legal in the United States, means that tools like this will never exist. Which seems net bad to me.

If you are saying it should be opt-in only for the pages analyzing specific books, like the instigator of this outrage screen-shotted, well that seems to fall squarely into the critical analysis bucket, so that is also quite ridiculous.

I understand some folks being unhappy that a portion of the works were pirated, but it seems like most of the outraged would be outraged even if he personally purchased each and every ebook.

Also, if you read through the Twitter thread a lot of the authors (not 100%, but a LOT) are doing a really great job portraying themselves as "stoopid AI-fearful luddites". Many of them think the site is somehow like ChatGPT and they don't bother to dig any deeper, or really at all.


Yeah, the article represents the voice of the authors in two tweets, from authors not apparently notable enough to have a wikipedia page. One I couldn't even find on Goodreads. It's obvious there's more to this than just the tweets presented. The article is unhelpful in this regard.


Jeff VanderMeer is not notable enough?


Personally, I have no idea who he is except some loud prick on twitter.


While I would agree in theory that a project like this would be best with opt-in, in reality that would just not work. Publishers would never opt-in to it, if they even respond to your requests at all.


Then don't do it?

Or, if you do it, do it privately and don't share it on the internet?

I'm not sure why this is a difficult idea; if asking for something and getting permission to do it is so difficult that 'would just not work. Publishers would never opt-in to it'

...then, it seems really obvious that even if you want to do it, technically can do it and you could maybe make a legal argument to doing it doesn't violate any laws...

...why would you do it? Why would you post about doing it?

Come on, that's literally being a selfish dick; spitting in people's faces and waving a 'too bad, you can't sue me' flag.

There are so many things, so many mannnny things that you could work on, why would you choose to pick something that you knew would upset people and you knew you wouldn't get permission to do if you asked?


Why ask permission to do something that doesn’t require permission? I see no more reason why an author should be upset about someone counting the words in their book & assigning sentiment than a builder should get upset about someone counting the # of bricks in a building and assigning subtle color shade differences to them. Neither the author nor the builder has lost anything by it.


Should I need the publisher's permission to write a review of a book? Personally, I find that idea abhorrent. This sounds like an interesting project, unambiguously protected under fair use doctrine, both as analysis and as transformative, and the authors got their knickers in a twist because they are scared of that which they do not understand.


Because copyright in fact is not that strict (Google Books does far more) and you don’t need to respect someone’s boundaries when they don’t have a legal right to those boundaries. Why should we sympathize with people who want far stricter control over the cultural commons?


Authors are not demigods, they don’t have a right to control the use of their works, only the reproduction.

When you publish a book you “consent” to the fact that people are going to take it apart, talk about it, review it, quote from it, and yes run statistics on it. If an author doesn’t want that to happen then they shouldn’t publish a book. Just keep it private, only distribute it to people you trust after they sign an NDA.

As far as anyone knows, no piracy has occurred. In the US you are allowed to scan books, index them, and post excerpts - it’s called Google Books and there was a big case that affirmed that it is legal. Downloading a book from a pirate website for the purpose of indexing by a computer program is not piracy, you have simply outsourced the scanning stage to someone else. It is only an issue if you download from some p2p protocol (such as a torrent) that also uploads and shares the book.


Because the authors were AI-fearful luddites. From "Book" to "Program that judges books" lies well beyond any argument that the use of the derivative work could supersede the original. It's such clear cut transformative use that the authors come across as grossly misinformed about copyright law as a whole.

Perhaps there is an argument for generative AI possibly superseding the original, in that people might start asking an AI to generate them stories "in the style of x" instead of buying the author's books, but this wasn't that. It was just some fun data analysis of books.


Summary: prosecraft.io counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions. This is not generative AI, this is old school statistics.

And then it sometimes presented a page worth of quoted text from a book. Which should fall under fair use.

https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...


> I don't think you even need fair use for this

You shouldn't, at least for posting basic statistics. They're facts, not copyrightable.


> counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions

You're pretty much describing exactly what an LLM "learns" about text. I agree that it should obviously fall under fair use, but as the author of this article found out, there are quite a few who (very vocally) disagree.


I think there is a big difference in terms of data recovery though. You can't take a compression algorithm, for example, and claim that its "just some statistical analysis" when it can reproduce the original perfectly. Heck, even if it can reproduce it approximately, that's a lot different than what we see in this particular example, where the data could not be used to reproduce a text at all.


generation of related text vs analysis of human understandable facts is very different in the mind of most people.

I think that using an LLM to get insights on the text should be ok, it's the generation part that scares them. probably rightly so.


Hrm. It seems like the authors are caught up in things like "vividness" score and the "sentiment analysis" of the text; I guess because it's loosely related to AI?

But it seems like a bulk of the stats collected are things that I would find really useful. I've probably asked myself, "how many words are in this book" on 10+ separate occasions, both as a reader and as a writer.

It also seems like there were also counts of things like adjectives, verbs, adverbs, passive verbs, etc -- stats that I might want to know about a novel.

The bulk of the service seems rather "boring" and non-AI. Unfortunate that the whole thing was taken down because of a few features. Hopefully it'll come back.


For this particular example, the tool doesn't seem like it's a big deal. It just analyzes works for data. I'm not sure how this would be any different from a literary critic doing the same thing manually.

In general, though, I think artists would be less hostile to technological innovations if the people imploring them to "figure out how to embrace the technology rather than fear it" weren't actively trying to destroy their livelihoods, almost always without the slightest interest in helping them figure out the new economic situation. The attitude is, "It's the reality now, deal with it," all while enjoying the job security and high salaries of tech jobs. You can see the same attitude displayed when it comes to piracy: "too bad, deal with it, I have a good job, I don't care if you don't anymore."

This stuff would be received far better by the creative community if AI companies were to say, establish an artist sponsorship program, push for UBI, or otherwise show that they care even a tiny bit about the people they're making redundant.


I agree with you. There’s a pattern that I see a lot, of having:

1. large powerful players doing something not entirely helpful;

2. victims of that protesting that change vehemently; all that in vain because the players are powerful and have sheltered themselves from criticism, usually via lobbying;

3. regulatory capture or protests go after a smaller player, which is widely advertised to accuse 2. of going too far — even when the problem in 1. is still entirely there, and now ignored.

It’s definitely the case with globalization (large conglomerate benefit, people protest, and a small artisan who started selling abroad is featured being victimized by tariffs), fossil fuel (large oil extractor, climate advocate, farmer seeing fertilizer prize go up), immigration, American cultural hegemony, car dominance over cities, etc.

That pattern allows larger players still doing harm to wash their morals. I feel like we need better antibodies to say: No, this does not absolve them.


> if AI companies were to say, establish an artist sponsorship program, push for UBI

Sam Altman, for all his faults, is actually a massive proponent of UBI. I mean, that was one of the claimed objectives of Worldcoin (though he advocates for UBI in general: https://thewalrus.ca/will-universal-basic-income-save-us-fro... )


I am defining "push for UBI" as "actually do something to pressure the government" and not just state that a for profit business you've established is trying to accomplish that goal.


I will admit that I am mildly confused by this outrage, but it is X/twitter so the standards are different.

All that said, I remember doing basic text analysis in college and then sentiment analysis in my MBA class.. is the concern out there, because of how source material was acquired?


Fully agree.

Not an artist myself, but this basic assumption in tech that you can just take somebody's shit without informing them, without permission, without compensation, without basic due diligence, and then go do whatever the hell you want with it needs to stop.

For the artists' sake but also for tech's sake. This model can't work, it's a complete dead-end that will wipe out livelihoods and culture.

But I can ensure you artists can/will be equally hypocritical themselves. Surely they've pirated themselves, removed paywalls from articles, blocked ads, borrowed the neighbor's Netflix account.


I think it applies to many technologies other than generative AI. How many devs actually think about ethics nowadays? I think it's all lost in the big companies they work for, behind the excuse that "it is not their job to figure out how their work is being used".

Interestingly, I think most devs would think twice before being paid for designing a missile. But somehow they don't really seem to think about the impact of work that is not obviously a weapon. Social network, Stable Diffusion, ChatGPT, SpaceX... everything disruptive has the potential to be very bad (I see a lot more harmful use-cases for ChatGPT than legit ones, but maybe that's just me). But somehow engineers seem to believe that it is not their problem.


Absolutely, and I think the recent Oppenheimer movie was an excellent take on this exact subject. At some point, you don't get to throw up your hands and say, "technology is just neutral” and absolve yourself of any responsibility for what you’ve put into the world.


My summary of the case: Someone did statistical analysis of a bunch of texts and created a tool that evaluates your text according to the developed model. Writers accused him of plagiarizing/using the content of their works.


As an aside, this would be completely legal in Japan, as classification and statistical analysis are protected as fair-use.

I wonder if similar language exists in other copyright systems, but I would imagine it is likely the opposite...


This was unambiguously fair use under American copyright law, too.


Are generative models also covered by that clause?

Because one thing about the generative models is that you could in theory get the model to recite copyrighted work, word by word.


> one thing about the generative models is that you could in theory get the model to recite copyrighted work, word by word.

Only if it's so badly trained as to be considered broken. It's not fair or smart to judge an entire method (LLMs) by some broken examples. I wouldn't claim "all cars leak oil" just because I saw a broke-down beater one day.


Something that we need to learn is that these brief outbreaks on social media burn themselves out pretty quickly. Everyone shouts for a bit and then moves onto to the next bit of manufactured outrage.


I agree, people get angry at something for a day and then the timeline tells them to be angry at the next thing.

Always feel bad for people who cave to the mob, usually if the mob is yelling at you you’re on the right track


I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

We already tried this with human communication and gave birth to the dystopian nightmare that is social media, why keep repeating our mistakes?


I kind of don't understand the issue - IANAL, so I'm not going to delve on the legality of things, but I think making automated book recommendations better is absolutely fits the bill for 'transformative use' - as in book recommendations are in no way a substitute product for books themselves.

And personally, I think book recommendations are an absolutely underserved market, if I liked a book, having the ability to find more like it would be an absolute godsend for connecting authors with people who would be interested in their works, resulting in much more potential sales for them.

I can't count how many times have I discovered an absolutely great book on Amazon with like 50 reviews accidentally, as well as other, objectively less recommendable books that have nevertheless made an impression on me.

Discovering these books is sort of a hobby of mine, and is the exact kind of activity an LLM would be a great help with.

Going further, if there was an LLM that could be asked for book recommendations for your particular tastes, it could also identify markets for books not yet written, and would give a hint to authors on what sort of books to write to find an audience.


> And personally, I think book recommendations are an absolutely underserved market

I haven't read about the industry in years but isn't it the case that the job of "book recommendations" is essentially the publishers job? They unironically try to sell you more than a book. An algorithm would threaten their worth.

(There are, of course, other useful functions like publishing and the irreplaceable editors, but neither require the capital strength of marketing.)


Stop with the "this is good, I want more" Skinner-box model of happiness. Try some serendipity instead of being led by a generic algorithm.

I discovered absolutely great books by moving slowly along the shelves of a library or a bookshop.

And you need to read bad books to understand the great ones.


The whole point of this site is for people to express their opinions, and torginus took the time to write a thoughtful comment anout LLMs and how they might help both authors and readers.

As for your point about serendipity, torginus never said that he didn't wander book stores and libraries looking for books he wouldn't have been previously exposed to.

Based on the post, I'm sure they understood the basics of reading a variety of books, both good and bad -- there is no need to get judgemental.


Are you seriously telling someone else how they should enjoy something?


That's silly. Humans review books all the time, using very similar words. Where's the outrage over that?

This is manufactured, stretched, overhyped objections. I believe it's all as the OP suggests, because the word AI is in there. Not because anything illegal or immoral is going on. In fact it's a terribly useful tool, and once the mob cools off it'll likely return.


You are exactly modeling the chauvinistic Silicon Valley attitude that is causing the outrage in the general population to begin with.

“Our algorithms are pretty much the same as human art criticism, so put down the pitchforks you unenlightened scum” is up there with telling them to eat (a Stable Diffusion generated picture of) cake.


> You are exactly modeling the chauvinistic Silicon Valley attitude that is causing the outrage in the general population to begin with.

Just like the writers he talked to and got positive feedback? Everybody not agreeing with you represents "chauvinistic SV attitude"?


(Edited)

No, he didn’t say anything about them. People side against their interests all the time, finding a few writers that like this is trivial. Are those people the majority opinion on this or are we just trying to prove how wonderful this technology is?


I'm assuming you read the article.

Let's recap:

> I launched the prosecraft website in the summer of 2017, and I started showing it off to authors at writers conferences. The response was universally positive, and I incorporated the prosecraft analytic tools into the Shaxpir desktop application [...]

And he goes on mentioning that some authors even reached out to him to get their books added.

Unless you are accusing him of lying or unreasonably overstating the response he got ("universally positive"), for which I really don't see any indication, then a statement like "finding a few writers that like this is trivial" is not a good faith engagement with this topic/conversation.


There’s no way to qualify the sample size of writers based on his claims so within the bubble of his experience I’m sure it’s correct but not useful to base an argument on that writers at large are onboard with this and as for good faith engagement your response to parent…

“Everybody not agreeing with you represents "chauvinistic SV attitude"?”

…wasn’t very good faith either as it’s unclear whether the writers share the same belief as some tech people that AI and humans doing stuff are the same and use that idea to further a pro AI agenda as opposed to them just finding a useful tool to incorporate into their workflow regardless of the underlying technology or politics. Your response assumed the former and paints parent poster as wrong based on your assumption. Some writers liking the tool, just like some artists liking stable diffusion, doesn’t invalidate the original criticism or imply their ideology.

Indeed my experience jives with what he said. Many AI people I’ve seen comment are very much “adapt or die” when it comes to AI technology, suggesting that writers/artists must (even if begrudgingly) use these tools to stay competitive and see many datasets as fair game even when their authors are against its inclusion in said datasets, such as the author of this article.


There's no outrage in the general population. Just of a minority that is just as small as silicon valley.


Counting the ratio of nouns to verbs in a novel is an algorithm and I think it's like one of the most basic examples of what the thing in the article does, if I understood it correctly.

But I guess there would also be people up in arms about this.


Do you even know the meaning of chauvinism? Because this is literally the opposite of chauvinism. They aren't stating their view is superior, they just want it to exist.


People starved while it was suggested they eat cake. Not sure how that relates - are the rights around art crit not the same as AI crit?


Of course they never were suggested to eat cake in reality. 1) The actual French quote was to brioche, a type of bread and mistranslated as cake because brioche wasn't common in English-speaking countries 2) Was never an actual suggestion -- the French philosopher Rousseau was making a sarcastic remark suggesting that if the people didn't have bread, they should eat brioche (a fancier kind of bread) instead. But for some reason in pop culture this was falsely transformed into an actual suggestion by Marie Antoinette


Glad to see someone else mention the falsehood of the original quote.

It's actually really fitting to see that (mis-)quote used in the context of this outrage since from reading through the original vitriolic Twitter thread it's clear that many of the most outraged are incorrect about what the product does.


IT doesn't really matter to the idiom - what is understood by the reference is what it means.


or talking about human "just another matrix", so how dare them don't want to offer their artwork for new models.

this even didn't contain how developers decided to let people lose job. people is angry because they worried about losing job.


Or AI-generated Soylent.


> That's silly. Humans review books all the time, using very similar words. Where's the outrage over that?

Easy: humans are not machines. "X does it all the time, so I should be able to do it" is never a valid conclusion. It depends on the situation.

> In fact it's a terribly useful tool, and once the mob cools off it'll likely return.

Maybe this tool in particular does not "abuse" the books. Maybe this tool in particular is terribly useful. But you can't blame authors and artists for taking a stance against those new algorithms that provably have the potential to automatically "steal" from their work. You can believe that asking ChatGPT to "write a novel in the style of X" is not abusing the copyright, that's fine. And the authors can answer that they fear it has the potential to break their source of revenue to a point where they won't want to publish anything anymore. And they are entitled to it. And maybe someday we come up with licenses that prevent the use as training data (how in the world could one conclude today that "it is most definitely fair use", given that this is a very new way of using IP material?).


That was the accusation, and it was misplaced here. So we agree, this is a smear campaign in this case, not a sensible reaction to a reasonable application of machine algorithms.

The idea that counting adverbs is steal their work to the point they won't want to publish anymore is clearly FUD. As my remark made clear.


> The idea that counting adverbs is steal their work to the point they won't want to publish anymore is clearly FUD.

I did not mean that, I am genuinely not sure if you rephrased my point to make it sound wrong or if you missed it.

My point was that, IMO, it does not matter to the other whether counting adverbs is stealing their work or not. Probably if you counted them manually they would be fine (and most likely they were fine before generative AI).

What matters to them is that generative AI is trained from their copyrighted material, and they fear it (I would, too).

The day people stop reading my blog because they can just ask ChatGPT and will get something generated (partly) from my material without any kind of attribution, I can promise you I will stop my blog.


This project was not generative AI. Comments are saying this project, which is not at all similar to generative ai, seemed to be okay. But you keep replying to say essentially “but if it was generative ai then authors have a legitimate reason to be angry”.

There is no need to shoehorn that debate into this particular situation, and I see no merit in defending authors that had a knee jerk reaction to this project on the grounds that they have reasonable fears about other types of projects.


I think it is not completely off topic. Here is how I see it:

Engineers tend to globally think that LLMs are not really a problem for copyright holders. At least those who develop LLMs pretty clearly don't give a damn. And on top of that, it is in their interest to not be constrained by copyrights.

If this is my feeling (that engineers globally don't care about copyright holders), then it seems reasonable to me that non-engineers could feel the same. That sounds fair, doesn't it?

So those people start speaking up when they see a situation where they feel like "it is happening". And because they don't really know the technology, it is hard for them to know if this particular case is a problem or not. And they can't really trust engineers to tell them, because engineers built LLMs in the first place, and really it does not seem like they care about copyright holders.

Finally, engineers see this reaction from authors, and instead of trying to understand where they come from, they dismiss their opinion. Which probably will reinforce the feeling that engineers don't remotely understand the concerns of those people, and keep building their AI-powered laundering machines. Again, engineers working on those technologies in big companies have absolutely no interest in even considering that it is a problem. Because they get a big salary to help their big company get more profitable, even if it kills many jobs and is a net loss for society (because they benefit from that).


To rephrase in my own understanding of what you wrote:

1) Some engineers (or more broadly, software developers) do not respect copyright

2) Therefore you reasonably are skeptical of projects related to material under copyright.

3) It is not always obvious if a project is respectful of copyright.

Now, applying these #1,#2,#3 you believe they justify the outrage for this particular project.

I disagree, because outrage combined with a lack of understanding (#3) is pretty much my definition of a knee-jerk reaction and vastly counterproductive to the interests of copyright holders because it will make the dismissiveness you predict a self-fulfilling prophecy.


> you believe they justify the outrage for this particular project.

No, I believe it explains it.

> it will make the dismissiveness you predict a self-fulfilling prophecy.

That's the thing: both parties need to listen to each other. The problem here is not this particular project, but the fact that we are not addressing the bigger concern which is LLMs.

IMHO, it is completely useless to try to solve this particular case, because it will happen over and over again. We need to address the LLM issue.


If you want to be a pitchfork mob against generative AI at least understand whether AI is generative or not? Seems like a reasonably low bar. This was non-generative AI, it didn't produce content it output metrics and labelled some existing content.


What makes you think that I don't understand whether AI is generative or not? What I said was that for artists who are complaining about their copyright being abused, it does not matter. 10 years ago they were not complaining, because AIs looking like ChatGPT (to users who see it as a black box) did not exist (or were not remotely as powerful).

And I understand that. It is not their job to learn how the black box works. What they see is that "machine learning models" (which they probably call "AI" now), which are complete black boxes to them (and that's justified: engineers who train them also don't know exactly what they do, but rather test their model on some dataset and judge it from there). And those black boxes are being trained from their copyrighted work and have the potential to generate a ton of money which they will never see.

You can go and say "you guys should learn how the technology works instead of complaining", but let's be honest: probably you are not an expert in AI yourself, and anyway why would the artists have to care? It is a totally legit question that they have: "Why can engineers take my copyrighted work, run it through an algorithm that does stuff no algorithm has done in history at a scale never seen before, make money out of it, and not even consider that maybe they are abusing my IP?".

Before dismissing the artists, you should try to understand their point of view.


I would disagree. Just because you don't quite understand something, doesn't mean your concerns are not worth consideration - consider the recent zoom TOS issue. I doubt that many of us have a deep understanding of how that data's being used, or the internal guidelines that zoom follows for its data use, and most people aren't lawyers specializing in IP law to know exactly how the law would treat zoom if they were to accidentally (or "accidentally") leak IP. We just see that they are putting in a clause in their TOS to allow themselves to do so, remember our own heuristics of how LLM have behaved in the past, and understandably start raising questions. For all we know, zoom's AI might be something constrained to a framework which doesn't allow for such data leaks to occur, or it's generative capabilities might be constrained in some other way. They're just demanding legal permission to do so, but that still rubs a lot of us the wrong way. Our concerns are still justified, even if Zoom never actually touches AI. Artists lack as concrete heuristics as the technical crowd. But they still have concerns that need addressing, and those concerns about the effects of AI still should be considered and respected. If the details of the situation don't match their concerns, care should be taken to explain how they don't match to the people in question, in a way that isn't looking down on them (admittedly, trying to be the calm voice is often a waste of time on the internet) That said, if you were to make an informational video which succintly summarizes the technical details that are relevant to artists, it might become sufficiently popular to influence debate.


(to clarify, this is a response to skjoldr's comment)


> It is not their job to learn how the black box works

If you have not learned the basics of how something works, you have no right for your opinion on it to be considered valid. Period.

Invalid opinions do harm to democracy and endanger our way of life.


> you have no right for your opinion on it to be considered valid. Period.

That is so wrong it is actually dangerous. Do I need to understand how a nuclear bomb works for my opinion on it to be considered valid? Obviously not. I only need to understand the consequences of it. It does not matter at all how it works, if I am against the fact that it will kill a whole lot of people.

> Invalid opinions do harm to democracy and endanger our way of life.

And engineers have done much, much more to endanger most living animals (including humans) than authors and artists: technology is the reason for the mass extinction we are currently living, and the problems that are coming with climate change. Maybe it's important to start thinking about the consequences of what you do, not only the technicalities of how you do it. And maybe it's high time you start listening to people who are able to think about the consequences of what you do (maybe they understand that better than you do, ever thought of that?), even if they don't know how to do it.


You can of course have any opinion you want. But this is not just about the authors having an opinion. It's about them starting a harassment campaign based on just faulty facts and making no attempt at verifying them.

If we work from the nuclear bomb analogy, you certainly don't need to be a nuclear physicist to protest nuclear bombs. You just need to have some a reasonably correct high level understanding of the impact of a nuclear bomb. But that's not what is happening here. This is more like storming the Belgian embassy to stop Belgium from using their nuclear arsenal to trigger a chain reaction in the atmosphere: totally detached from reality in every aspect.

As far as I can tell from your messages on this, you think that the harassment was entirely justified. Is that correct?


> totally detached from reality in every aspect.

I don't think it is totally detached from reality. I believe that engineers are generally pretty bad at realizing the impact technology will have on society. There are many concerns with generative AI in general: it can potentially "break the Internet" (by finishing breaking search engines which already struggle with SEO), or maybe democracy, who knows? Copyright is one such problem.

> you think that the harassment was entirely justified. Is that correct?

I honestly don't know how far it went. What I saw in the article is a few authors who wrote online that they wanted their book removed from that software. Not sure if it is closer to harassment or to lobbying.

What I see, however, is many comments of engineers who don't see the problem with copyright and who don't seem to understand why non-engineers may be against this technology, or why one would even think about forbidding a technology ("but technology is neutral"). My point is just that those engineers should maybe take a step back and try to reflect on that "technology is neutral" belief.


You cannot know the consequences of something if you do not know how it works. Case in point: nuclear reactors. If you do not know how they work, what are their potential dangers, and how they are mitigated by smart design, you do not have a moral right to protest against them. Simple as. Understanding the risks and consequences equals understanding the system in question. Always. This also applies to nuclear weapons, if you do not understand MAD and how they keep other powers in check, and you never had a true threat brief that would explain what exactly nukes are a deterrent against, you just aren't entitled to an opinion on them. Especially one as simple as "it can kill people so I don't want them". This is an invalid opinion, sorry.


If you literally have no idea what a nuclear bomb does, i.e. don't know that it explodes, releases massive amounts of heat, or can kill many tens of thousands of people at once, then no your opinion should NOT be considered valid.

Understanding the consequences of something is a PART of how it works. Since you understand that it can kill a whole lot of people then I'd say you have passed the incredibly low bar.

In this case most of the authors do not understand the consequences of the tool, they think it will generate convincing sound text that sounds like them or that it is serving pirated copies of their books (sourcing that from the original Twitter thread that I unfortunately read a lot of).

This doesn't seem like the thread to debate whether technology is a good thing, but I can't help but call this assertion ridiculous. Technology is responsible for almost every single good thing in the world today.


> In this case most of the authors do not understand the consequences of the tool

Because you do? That's my point: engineers believe that because they have some understanding of how machine learning works (and in my experience, usually it is very limited...), they can conclude that they understand the consequences of it. Simple example: the Facebook "like" function, that was supposed to be positive ("oh nice, I got likes"), and actually increases addiction and is mostly negative ("oh no, why did I not get likes?"). Clearly those who implemented the first likes had not realized what consequences they would have.

> Technology is responsible for almost every single good thing in the world today.

If you have a very limited view of the world, I guess it could be. I like trees, flowers, bees, birds, mountains, snow. Can you tell me which ones come from technology? Let me help you: most of them are threatened to die in this century because of technology. For most living species, every single improvement technology is bad news. To the point where it is now globally becoming bad news for humans, because it's quite likely that we will get into global instability, wars, and famines in the next few decades because of technology. Think about it when we start having billions climate refugees, and think about how you were dismissing opinions contradicting your beliefs based on the fact that you understand some implementation detail.

But let's even ignore the fact that the next few decades will most likely get pretty bad for us. It is true that right now, we live longer, we have more food (and obesity problems), and we can cure many diseases that we could not in the past. Does that mean we are happier? Happier than whom? Vikings? Ancient romans? Ancient greeks? That question seems closer to history and philosophy... why does your opinion count then? Are you historian/philosopher?


Because fair use allows transformation and the output of their algorithm looks nothing like the input of the copyrighted work? For generative models its more complicated because generative models can actually reproduce large sections of a copyrighted work so transformation is a bit less clear.


> Because fair use allows transformation and the output of their algorithm looks nothing like the input of the copyrighted work?

I feel like you miss the point of a law. You seem to read the law, and say "well, the law says X, new technology Y is compatible with it, so that's legal, everyone is happy". But that is wrong. The law reflects the society we want. Do we want a society that completely kills creative work because Big Tech found a loophole to launder their IP? I guess we all agree that we don't. It is not clear if LLM is that loophole, I agree. But you seriously have to take a step back and think about that. What if it does? Then we may have to redefine the meaning of "fair use".

Maybe this particular software was not a danger for those authors. But they don't know that. And given that most engineers talking about LLMs don't seem to remotely understand how one could be worried about it, I understand that they start speaking up wherever they can't. Because clearly it does not seem like those who build those systems give a damn about copyright holders.


I think the actual issue is nuanced and complicated. I think it's fairly clear the tool in question that was non-generative AI is the kind of thing we want to allow under fair use. Whether we want to allow generative AI is more complex, I'd lean towards requiring a license because of non-deterministic duplication. Fair use is an important part of copyright law and we should be very cautious about eroding it. For example, I like Green Day's transformation of the scream icon and think it was substantially different enough that it should be allowed. The courts agreed under current transformation laws but if we weaken protection against transformation we likely reverse the ruling of cases like that as well.


They… quantify the number of adverbs and voices? I'm sorry, but have you ever read either a book or a review of one?


Yes, and they have 1000 times the spoilers and quotations and judgemental attitude of ... a summary of adverbs and voices.

So yes, I understand what a review is, thanks for the put-down, that certainly added something to the conversation.

I think we are in agreement - doing statistical analysis on written works is entirely a lesser thing than simple review, and is harmless.


"It's a mistake." OK, you could be right.

"Needs to stop." OK, you could be right on that one too. I don't think you are, but that's not the point.

Neither of those adds up to "it's currently illegal". (Whether it's actually illegal probably depends on the details of how he did what he did.)

Further, neither of those things adds up to "the howling mob should attack him until he stops". (Even if the "attacks" are purely online.) I am against "attack him with outrage dialed all the way up to 11 without actually understanding what his tool is and does". I am also against giving in to the outrage - it just shows the mob that baseless outrage attacks work.

You think it needs to stop? Fine. Persuade him that it needs to stop, and therefore that he should stop. Convince him - not with a mob screaming in outrage, but with reason.


As someone who has published two novels: The outrage over this site was stupid, ignorant and a demonstration of a witch-hunt that will help nobody.


If you’re an author of books intended for children, your texts are likely already being quantified to produce a reading level difficulty score:

https://metametricsinc.com/parents-and-students/lexile-for-p...


Honestly, this is the really offensive part of the article. Who cares about whether or not it's legal, the idea that it's, in any way, shape, or form, useful is bafflingly laughable.

Not everything can be meaningfully quantified. Not everything needs to be.


Certainly something interesting is bound to come out of quantifying things? "Hm, this three act structure thing seems to work, I wonder why." "Children doesn't seem to understand texts which include these words, I wonder why."

Patterns rarely show themselves before we investigate.


> Certainly something interesting is bound to come out of quantifying things?

In science they call this trap P-hacking. Even data "scientists" know to be wary of overfitting. We're really good at finding patterns, but few of them actually mean anything.


>> Certainly something interesting is bound to come out of quantifying things?

> In science they call this trap P-hacking. Even data "scientists" know to be wary of overfitting. We're really good at finding patterns, but few of them actually mean anything.

Quantifying things is not always p-hacking. When people do experiments on novel materials or structures they quantify the data, make readings and record them, and then look for patterns. For example measuring the electronic properties of a new novel nano structure or molecule.

When I think of p-hacking[1] I think of using the same static data and doing various data analysis over and over again until something potentially interesting is found and ignoring the risks of false positives as you do so.

[1] https://en.wikipedia.org/wiki/Data_dredging


> Not everything can be meaningfully quantified. Not everything needs to be.

Ok, so who decides what's OK to analyze or not? Is there some obvious moral line I fail to see, that everyone would immediately agree on?

It seems the project was about analyzing books, not about producing new books. How is that hurting the authors?


What will hurt artists is, when in 10 years, all publishers are demanding that the vividness score (TM) be at least a 95% “because that’s what drives sales”.

Which is what will happen if the authors don’t proactively stop it from happening. Look at how the music industry has evolved over time.


How his this different from all the vampire novels that hit the shelf after the success of Twilight? Publishers alway preferred the money makers, just the measure changed.

Nowadays writers can at least publish their books without the need of publishers and I think some like the help of the bad Silicon valley stuff that made writing, publishing and interacting with the readers easier.

I'm on your site if it's about automatic content creation and style copying but text analysis is not the real danger. Especially when the usefulness of such statistics isn't even given.


> publish their books without the need of publishers

Except those are very likely to be metoo vampire novels. And lately LLM generated.

I'd move that on the contrary, the role of the publisher as a curator will only become more important in the future.


But publishers will have to deal with a lot more content thanks to LLMs.


Or it could help me find terser books I like, people will still have preferences and if the author tries to pander to only the largest market segment I'd argue that's on them.


I think it’s much more likely you would get the book equivalent of crap SEO sites spammed out to satisfy numerical measures of quality.


How is this different to the current process, other than feedback is slower (if forthcoming at all) and less specific?


> How is this different to the current process, other than feedback is slower (if forthcoming at all) and less specific?

Let me rephrase your question: "how is it different to the current process, other than <the fact that it is different>?" :-). I would say that the answer lies in the question.


Sounds as though your view of the AI is purely positive, in that case. That's fair enough. The answer for other people may well not lie in the question (e.g. for all the people who don't like this development), but it did for you!


Sorry I did not understand that :-).

My point was that it is different: when humans read a book, they don't train a machine learning model. They can't read as many books as a machine, at the same speed, and they can't remember nearly as much as what a machine can.

Humans and computers are fundamentally different, and it matters. You can't conclude that because it works for one, it will fork for the other.


> Sorry I did not understand that :-)

You seemed to be saying that the differences I listed (quicker and more specific feedback) were the only differences. Those are both positive.

I was saying that some people may think there are negative differences as well.


Right. Yeah I did not express myself clearly, sorry :). You were saying "how is it different other than X and Y?", and I wanted to say that X and Y are already enough for me to consider them different.

I am actually on the side that LLMs are a big problem for copyright, and I don't want my code and blog posts to be used in their training dataset without my consent. To me, at this scale, it's not fair use. IMO it's a bit like if Facebook said that it is fair use to leverage metadata about their users, because "someone who sees you in a public space talking to a friend knows that you are talking with that person, and it is the same for Facebook on social media". My problem is not that Facebook knows that I sent a message to a friend now, but rather that they know who writes to whom and when, at scale.

Similarly my problem is not that somebody could read my blog post, learn from it, and write another blog post. My problem is that LLMs automatically train on all written material they want on the Internet, at scale, and without acknowledging that all that material has a lot of value (and is copyrighted).

I think fair use should somehow consider the scale.


the difference is that an machine analysis is necessarily limited and can't account for all the factors that make a text interesting. so it is possible that this analysis rejects texts that would not be rejected by a human.

it is objective but potentially biased. and it could even be discriminating if the input for this tool isn't diverse enough. but these are the issues that can go wrong with any use of technology, and we have seen many examples of that happening. however i don't think that is problematic if writers use it to analyse their own texts in comparison. it is however a serious issue if publishers use it to decide what to accept


Again, I don't particularly care about whether this is allowed to exist, I'm just here to laugh at the mindset that lead to it being created. But sure, I can see this being used in harmful ways.

> It seems the project was about analyzing books, not about producing new books. How is that hurting the authors?

"Vivid books are really in this year, we're gonna have to ask that you aim for a Vividness(tm) of 85 or above."

"US books have 15% more adjectives, clearly this is proof of our superior detail-oriented work ethic!"

"What does the rise in Emotion(tm) have to say about the decline of society?"


So if I understand you correctly, you're saying that we should not create "metrics" for anything because said metrics could be misused by clueless people?


The analysis is cool. The problematic thing is what would have happened next, if this tool turned out to be any good.

Publishers rejecting manuscripts because "this years trend shows customers are looking for vividness in the 70+ percentile, your book is only at 55". Everything becoming the same style. If you thought Hemingway, Joyce or Nabokov had it bad with rejections, there'd be zero chance for actual innovative writing to break through the walls of The Algorithm.


Joyce should have had more rejections, but that’s just my personal opinion


> Not everything can be meaningfully quantified.

Sure, but written words _can_ be meaningfully quantified. We have been doing that for thousands of years. Starting with numerology and other mystical/religious beliefs, poem metrics, stylometry, crypto analysis, stroke counting, to name a few.

> Not everything needs to be.

Why not?


> Honestly, this is the really offensive part of the article.

I would argue that "Offensive" is either hyperbolic or you've used the wrong word.

> the idea that it's, in any way, shape, or form, useful is bafflingly laughable.

I don't know if it's useful because I never tried it. I might harbour my doubts but I'd like to find out. This is how I approach new things.


If you don't find it useful, don't use it. But why get outraged about something that others find useful? It's clearly a tool that other writers were positive and excited about. Why not let them have it? If you don't find those quantifications meaningful, so be it. You don't need to use it. Why force your opinion on others?


Simple. Just allow an opt-out for Authors or Publishers. Then only interested parties will comprise of and make use of the service, like you want.


As the article stated, there is nothing either legally or morally wrong with what the site did, and many authors found it useful. Let us know when you come up with an actual counterargument based on reason instead of an appeal to emotion fallacy.


I probably agree, but how does this have any relevance to copyright? If the tool is bad but otherwise legal then it should just fail on its lack of merit.


There is a difference between a statistical analysis of a text to categorize by certain words or word groups and training an AI model to generate texts on the data used for training.

The later creates massive competition to human writers, the former is just an information for potential readers.


Both the former and latter are information for writers. Neither create massive competition for writers (not that there is any law against creating competition), just FUD and better tools for writers.


Things like GPT already create competition for authors even using their names.

https://news.ycombinator.com/item?id=37042561

Pure text statistics won't do the same.


Wrt. your link, the same thing could have happen to this author if these spam books contained complete gibberish (and someone listed it on Amazon/goodreads using the authors name). This isn't legitimate competition (i.e. books written by LLMs that rival the quality and style of the actual author). This is a failure of the selling platform to QC the books they are selling.


LLMs make the scams better just like they will make spam better.

If it's gibberish you know you got scammed, LLM texts look convincing so you don't know for sure.


I agree LLMs can make better spam. But good spam isn't real competition. It's not like anyone is debating whether they should buy the latest book from their favorite author or the latest book from their favorite author's clone LLM (which is known to have written some solid books). Again, this is an issue that needs to be solved by vendors (it seems like all they need is a system where authors get a copyright to their name, and can curate the list of titles published under their name).


That's partly true for established authors,but how many readers will get fooled and think those books are real. See these 11 Harry Potter books from China

https://11points.com/11-amazing-fake-harry-potter-books-writ...

It's worse for new authors, they disappear between all the AI authors.

Publishers and readers will have to search a bigger haystack to find the needle.


I agree that it's probably not that useful, but to actually take offense? The outrage seems to misunderstand the law and the technology. If you think the numbers offer no meaning, then just ignore them. People produce bad tools every day and the world still turns.


I'm disappointed he went for "vividness" and not novelty. Judging text based on how uncommon the n-gram is/how much it differs from an LLM could be interesting for sure.

The better an LLM can complete your joke the worse it is, for instance. Important to have a good Letterman-MacDonald quotient.


> I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

Usefulness is immaterial here.

Is he allowed to do this? Yes.

What's wrong with presenting a page count and word count, for example?


> I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

Anyone who is with the artists should pass a law. Moral outrage is not law.


And promptly smack face first into the First Amendment. There is a reason they are going with moral outrage. Because they know they don't have the right.


The guy who wrote code is also an artist, and he is allowed to publish his book reports.


Well... We still can not agree if...

Technology has to be protected from dumb people, or is it worth protecting dumb people from technology....


If anything, it’s the smart people that need to be protected from technology, because Silicon Valley is obsessed with pulling them down into a Harrison Bergeron nightmare where they’re absorbed into the same modeled probability distribution as the rest of the population to better sell them ads (outliers are bad for profits.)


What do you even mean by this comment? Have you considered the possibility that people are smart in ways that you are not considering, rather than just labeling it “dumb”?


Have you considered ... in ways that you are not considering...?

I am pretty confident they haven't. Sounds like you've set yourself up for a reverse "true scotsman" here ;)


Nice catch, thanks for pointing it out.


For this very particular project I agree the reaction seems exaggerated, even though it does walk the thin line of copyright infringement. But as it happens, it rides the wave of all other AI project which started small then headed we know where. Because once your book is in the database of company X you can bet safe money they will take it and continue their "analysis" as much further as they like because hey you did not complain - and I believe you must defend yourself in order to keep the copyright.


The response to this undermines my ability to take the “backlash” against AI as anything other than innumerate, mob idiocy. It’s hard to prevent myself from being negatively polarized against the backlash when people in the backlash defend outrage against obviously innocuous things like this.


We repeat the mistakes because in the short term, someone finds it profitable, hence a prisoner's dilemma type situation.

If an AI tool was killed, I consider it a victory. That's because even if there are some small useful applications of AI, AI on the whole will certainly put most creatives out of business.

Instead, I propose the following: anyone who is interested in preventing AI from taking over their craft should join me in a coalition of ban AI from their own business. By placing a notice that your work is "100% AI FREE", you are doing something akin to the fair-trade/sustainably sourced sticker on chocolate or other food products: you are letting consumers know that your work was made by a human, so that they can support you.

If enough people get in on this, and pledge to support only those creators who don't use AI, then we can make AI an unprofitable venture and hopefully kill it forever!

I already put a 100% AI FREE badge on my YouTube channel, which means that I will never use AI for writing scripts, editing videos, producing images, etc. Moreover, I also pledge to support other creators who pledge never to use AI, by buying their products over others!


Without trying to sound flippant - what do you define AI as? Things like autofocus in your video cameras or automatic gain control or noise cancellation in your audio pipeline could also be considered AI. Do you remove those too? What about the AI recommendation algorithm built into YouTube - how do you reconcile being AI free while still using that platform?


Yes, you are right, and I advocate the following: a detailed look at each of these technologies.

However, for practical purposes, a direct definition that encompasses every situation is not necessary, but can evolve. For now, I think we do not need a precise definition and we can start with the following: AI such as ChatGPT, LLMs, image generation tools like DALL-E and ohters, should be restricted.

As for YouTube's algorithm, I agree it is also dangerous. For now, I have restricted the use of direct content generation algorithms, in other words, all content can reasonably said to be human generated in terms of writing, composition, etc.

In other words: AI that makes any creative decision in making content should be banned. Other algorithms should be carefully debated.


Banning automation technology because it could put workers out of business.. isn't that the textbook definition of a luddite? Also, are you saying no creative people are using these technologies? It's not all "enter 1 prompt, get image, call it a day", they are tools that can be and often are part of complex chain. Creatives that don't want to use these tools are probably going to be superseded by creatives who do.

What's your take on generative fill in Photoshop?


I am a luddite. What's wrong with that? I don't believe that all technology is bad, but that AI has reached a stage sufficient so that the order of magnitude of the changes it can affect are too damaging for humanity. I do believe that AI has become advanced enough to pose such a risk to us.

Some creative people are using these technologies, and while it is quite human guided NOW, at some point, the guidance that humans put into it will lessen. That's not to say that AI will ever produce a work like Dostoevsky --- maybe it won't, but it WILL be enough to eliminate most creative jobs, and reduce them to being at most being supervised by people who don't have much of a passion for creative works. And that's a shame, because it will remove the passion of creativity from society.

Generative fill: I don't use it, and that's part of my personal ban. It goes too far. I only use traditionl editing techniques in my photography that works with basically what is there.

Yes, you can say that photography has always been about manipulation, but basically, I have a personal line that I believe I can define sufficiently well, that is far behind the line of AI.


What about the youtube speech to text AI that creates automatic CC transcripts for the hearing impaired? What about the AI that translates transcripts and comments into other languages? (translation certainly makes use of creativity since not every word maps 1:1)


One can always say that AI has some positive uses like CC transcripts. And of course, I can't prevent the platform from making that. I only mean to say that personally, I will not use it in the basic process of video creation. If YouTube ever forces any sort of editing on my videos through AI, I will quit the platform.

But returning to the topic: even though AI has some benefits, I believe that AI in the long run will have negatives that FAR outweigh the positives, so I believe it still should be restricted.

As for translation, well, the AI transcription/translation sucks. I do attempt to put manual captions in my videos as much as I can though.


So far we have identified like 10 ways you and your audience are currently benefitting from AI, but you haven't mentioned any concrete way AI is harming you.

Also, what do you mean by "forces any sort of editing on my videos through AI". Do you mean like, changing the actual content of your videos?


Relying on Authors Guild, Inc. v. Google, Inc in order determine fair use for AI models, there are a few key aspects to consider. The outcome should not supersede, supplant or become a replacement for the original works, nor should they sell portions of them. It should also preferable enhance the sale of the original work to the benefit of the copyright holder.

In this specific case regarding prosecraft, all those criteria might be fulfilled, and it might be so that under those specific conditions the use of any copyrighted work for the creation of AI models is fair (or at least under US law).


its sad he took the site down, it looks like a neat project. it seems to be fair use, so it really is just an issue of consent and keeping people happy. the issue is some people will always be fearful/miserable. should the rest of us be held back in exploring culture because they refuse to play?

anyway i'm sure there will be ten other similar sites by the end of the week...


“That book’s vividness score (TM) is 75% opposed to that other book’s vividness score (TM) which is only 50%! That’s, like, a 50% higher culture score (TM)!”


What if I prefer books with a 50% vividness, so bought the book that was to vivid for me and write a bad review or are online reviews also considered bad Silicon valley tech?

BTW do you write per hand or type writer or do you use word processors with formatting, thesaurus, spell and grammar checking?


With CLOUD and AI happily in the descriptions, with no private desktop version, and your email to sign up, please.


Doesn't seem like the fear of AI more just authors being petulant, didn't we have the same thing with some hack Star Wars book author attacking the internet archive for daring to host a copy of his book before the AI fad.


What I find so odd about all this stuff is the target is very rarely OpenAI/ChatGPT. I understand it can be a useful tool, but if your concern is that AI has scanned your books without your consent and can generate new content in your writing style, then OpenAI is who you should be complaining about.

Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Open AI is who people should be targeting since they are the ones who have all the money and the politicians in their pockets to basically stomp out any competition. My real fear is not that people find creative uses for AI on a small scale, but that Microsoft/Open AI builds a centralized system that works on their terms where you are forced to play by their rules and they decide what is fact and fiction.

I haven’t read Zach Rosenberg before, but I put this prompt into ChatGPT, and sure enough, it generated what I presume to be writing in his style:

> Could you write two paragraphs in the style of Zach Rosenberg arguing in favor of shutting down a tool that uses AI to analyze the text of his books?

Did the author consent to Open AI scanning the text of his books to generate new text emulating his writing style? Where is the outrage over that?


> Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Isn't it obvious? Bullies always go after easy targets. And nothing is more popular or loathsome than self-righteous causes for bullies.


This is an incredibly biased article, hinging entirely on the assumption that AI training is fair use.


"Fair use" only applies to instances of copying / redistributing. The hint is in the name: copy-right.

There's a notion, which seems to have taken off among creators who are paranoid about AI eating their livelihoods (which it might eat a chunk of) that copyright prevents people from doing anything with works they [legally] acquired other than personally read, listen, or watch it.

That's not how copyright, as it has existed in the past, works. You can do all the algorithmic processing of your ebook collection that you want. You might be able to display small portions of a book to others, depending on the situation.

Quoting one or two paragraphs out of an entire book seems like reasonably safe fair use, but that won't stop a copyright-maximalist creator (or their publisher) from suing you, and won't stop some copyright-maximalist judge from ruling against you, so it's probably best to minimize the amount of content from a book that you redisplay directly. But you can do all the analysis and statistics generation you want, and display those results to others.

It remains to be seen what judges will do with AI generation of works based on ingesting gigantic amounts of copyrighted work. The entire framework of copyright is going to be broken, and until Congress steps in and changes it, judges are going to go every which way. There's no bright line for 4-factor analysis; it's always been a gut-level "is this a reasonable use that doesn't impact commercial sales too much". There's no possible rational way to draw a line. AI models can generate a painting of a new subject only loosely in the style of a contemporary painter, which would not be copyright infringement, or it can generate a near-clone of an existing work with the right prompting, and depending on how clever the prompter is, a lot of intermediate stages of likeness. Who decides how close to an existing work is too close?


Say I make a tool where you can enter the title of a book, and get the full text of the book without paying for it. I assume we all agree that would be illegal, right?

Now say that instead of distributing that tool as an executable, I distributed it as a library. It would contain all the same books as the illegal executable above, but some developers would need to write an actual executable that would use the library. Obviously that executable would be illegal (same as the first one above), but would the library be illegal? I would think that we all reasonably agree that my library would be illegal, because I would essentially be selling the content of books for which I don't own the copyright.

Now what happens when I distribute a trained model that has the potential to generate illegal work? Is it suddenly legal for me to sell it, because some developers write legal executables with it? How is that different from my library example above?


Google Books already does exactly this. It has a library of the full text of millions of books. Users can search for a passage of text and google will display the paragraph where the passage is found.

https://books.google.com

example:

https://i.ibb.co/DCxJpHN/IMG-3143.jpg


Google does not provide the full book, does it? Exactly like they could provide a few seconds of a song, but not the song in its entirety.


No, they don't provide the full book, just a few sentences before and after your search prompt (same as Prosecraft).

In both cases, however, if you had the patience, you could search the last few words of the text returned from your prior query and slowly work your way through the entire book.


A few sentences? For most books I've seen, it's a few pages. Google will block you from retrieving more pages from the same book eventually. Using a VPN and a different account may get around one limit, but I experimented with multiple VPNs and browsers once, and although I was able to get a majority of a book's pages, after that google stopped showing me full previews of any of the remaining pages no matter where the request came from.


It shows you a few pages if you are previewing the book (i.e. "look inside"). But if you are using search, it will show you where your search query shows up in the book, no matter what page the search query is found. This means you could theoretically search a book sentence by sentence, and it will eventually have shown you the entire book. I'm not claiming this is an efficient or practical way to game the system and read books, only that google books does contain the full copy of the book text and can reveal the contents of any passage. This is basically how Prosecraft works (at least what i glean from the article) - it doesn't let you read a whole book, even though it may contain a representation of the full text.


Sure. I really did not mean that specifically for Prosecraft. But the article questions why authors are attacking Prosecraft "because it does no harm". My answer is that authors don't (and can't, really) make the difference in a per-case basis.

At this point what they see is that LLMs trained on their copyrighted material are able to generate similar material thanks to their copyrighted material that was used in the training (that is important!), and they see that they won't get paid for that. Of course they are scared, and they should be. And of course they will now start attacking everything that looks like it is using their copyrighted material as training data.

I really don't get why the engineering world does not get this: LLMs have the potential to ruin people's jobs, it is not clear at all that this is legit (IMO LLMs could not do it without the copyrighted material they used for training, therefore they are derivatives of the original work), and those people are rightfully scared.


I think most people who have thought about it understand the impact AI models seem destined to have on writing (and digital 2d art, soon music, and later other things). In addition to writers and voice artists panicking, see the Hollywood strikes, for instance, and what's currently happening in the corporate world to digital artists.

Copyright is not the correct tool to address it. In the U.S., the basis for copyright law is a clause in the constitution which allows Congress to pass laws to secure for limited time creators' exclusive rights to their creations, for the purpose of "promot[ing] the progress of science and useful arts."

If AI is going to displace nearly all of that content, at least content that's mostly entertainment and less educational (novels, music, images, eventually movies), doesn't that imply that AI is better at generating useful entertainment than humans are? Aren't hypothetical laws restricting training of AI on past works to protect the market for future human-created works a direct violation of the constitution, if AI is better at generating "useful" entertainment (entertainment that people will buy) than humans are?

To whatever extent AI can eventually displace nonfiction content in the sciences and liberal arts, which is more indisputably useful, the same argument will apply; however, it applies to fiction content first, because current AI models have more trouble with facts (trouble in the form of hallucinations, for instance). The threat in non-fiction areas of creativity is more distant than the threat in fictional areas of creativity which are not as bound to reality.


> I think most people who have thought about it understand the impact AI models seem destined to have on writing

Go back to the beginning of social media, and tell me that "most people who had thought about it had understood the impact social media would have on society". It is really not a given. And that is my criticism: we see from history that it is not straightforward to understand the impact of new technology, but we engineers keep making the same mistakes over and over again.

> Copyright is not the correct tool to address it

Maybe not, that's right. I don't think anyone disagrees. The issue - at least from the point of view of artists - is more that some people (including authors and artists) want the problem addressed, and others (including engineers) just want to make money with their new toy and don't care much about addressing the problem.

> doesn't that imply that AI is better at generating useful entertainment than humans are?

I don't think so, no. It is maybe economically more successful, but I think it is clear that what is good for the economy is not necessarily good for society.

> however, it applies to fiction content first

Well... that is ignoring all the black hat use-cases, going from phishing to political mass manipulation, I would say :-)


Why didn't you just say that, instead of posing a hypothetical about software that may itself contain full book text which can be used to display (in this case fair-use) passages to end users? lol

I think the disconnect between your point of view and mine is that I see "training an LLM on copyrighted text" the same as a person reading copyrighted text, which is perfectly legal. And I see violating copyright as a person or LLM reproducing copyrighted work (illegal). But using other works as inspiration for something novel shouldn't be considered illegal, whether a person or LLM produced the work. I would even be fine with literature being treated more like music, where reproducing the essence of a piece of work (i.e. doesn't have to be a word for word reproduction) is considered a violation. But if the LLM creates something completely new, how is that a derivative work / infringement?


> Why didn't you just say that, instead of

Because I answered to a post that was talking about drawing the line for fair use. I just shared my view of how I see it. To me, OpenAI should be responsible for not giving copyrighted material to users if they are not allowed to do it. This means that they should be sued every single time someone manages to extract what is considered as copyrighted material from their software. Because the authors never gave them that right.

You Google Books example is different: the most obvious difference between that Google Books does not pretend that it is their content: they clearly say "here is a passage of this book".

> I see "training an LLM on copyrighted text" the same as a person reading copyrighted text

Yes, I think that is the main discussion point around LLMs. My point is that machines are not humans, and therefore they should not be blindly treated like humans. We should think about the consequence of the machines doing what they do, and decide whether that is legal or not in our society. Otherwise we would give machines the right to vote ("humans can vote, I don't see why machines couldn't").


> This means that OpenAI should be sued every single time someone manages to extract what is considered as copyrighted material from their software.

I agree! If GPT4 is outputting copyrighted material beyond what is considered fair-use (i.e. substantively more than what is provided by say, google books), I agree that is copyright infringement.

Indeed it is about the output, and making stuff available that people would otherwise have to pay for (or more precisely, enough of the copyrighted work that a person would have reason not to pay for the original work, causing a material loss to the original author) - that is a fineable violation imo.

Something else to think about... I work in biotech and have published articles in scientific journals on cellular and molecular level disease sequelae (such articles are also protected by copyright). Models trained on scientific literature are now being used for novel drug discovery and disease treatment pathways. These models are already outputting suggestions that seem very promising. Shall we also not provide these models access to the full corpus of scientific literature? It would significantly handicap these models to not have access to copyrighted scientific works. On one hand, some proportion of researchers will retain their jobs that would have otherwise been outsourced to LLMs (perhaps even myself). On the other hand, some amount of future patients will suffer or die from a disease that would have otherwise been cured.


That actually brings another point: if you train LLMs on scientific papers, at least in some domains it will make it easier to write a lot of papers. I am not an academic, but it is already my impression that there are a lot of low-quality papers out there. What if now many more get generated by LLMs? Won't that be a problem?


The low quality problem with primary research publications is not the writing but poor experimental design, misrepresenting experimental results, shoddy statistical analysis, and putting null results into file cabinets. Summarizing research findings isn't the crux of the problem, so if anything if an LLM can help the author perform a clearer and more concise writeup I'd see it as a net benefit.


One can use Photoshop to create heinous things which would be highly illegal to sell, or even show. Should we ban Photoshop? I think we agree, that no, that would be silly.

Let's assume I train a model on public domain texts only, zero copyrighted material has gone into it. Yet, technically and also practically, the model would have the ability to create illegal, copyright infringing content, given the right input. Should distribution of such a model be illegal? I think not.

Also, let's keep in mind that these models are not archives, that contain the original data verbatim. They are effectively lossy compression algorithms, that capture the essence. Some models are a few tens of GB in size, yet have been trained on TB worth of data, so clearly a significant data loss has occurred, so it's a bad analogy to your "keep a bunch of books as in a library" example.


> One can use Photoshop to create heinous things which would be highly illegal to sell, or even show. Should we ban Photoshop? I think we agree, that no, that would be silly.

This is a different matter, just like saying "we don't make cars illegal even though cars can be used illegally". IMO it is off topic in the context of my post above (or you need to elaborate).

> Let's assume I train a model on public domain texts only, zero copyrighted material has gone into it.

Do you genuinely not see the difference, from the point of view of the copyright holders? If OpenAI could build a version of ChatGPT without using any copyrighted material at all, I agree with you: it should not be made illegal. But the illegal use of it should be. I guess we agree here, but to me that is quite different from training ChatGPT with copyrighted material. The first difference being... well... good luck training ChatGPT without copyrighted material (probably now it's too late anyway, because weights of models trained from copyrighted material are just all over the Internet).

> Also, let's keep in mind that these models are not archives, that contain the original data verbatim. They are effectively lossy compression algorithms, that capture the essence.

Say I creat excutable dat do som kind of loss cmpression tht captred essnce.

First, do you agree that the sentence above is "some kind of lossy compression that captures the essence"? If yes, would you consider it legal for me to use that algorithm on famous books and sell them under my name, or would you think that I abused the copyright of the original material?


A car has the potential to be used for illegal activities. In fact happens every second, but that does not mean we go out of our ways to make them illegal.

That said, it is still an open question if the content generated by LLM is illegal. It only generates content that is inspired by some authors work, it is not recreating their work, only their style. Humans copy each other all the time in writing, music, art, you name it.


Either you completely missed my point, or you genuinely believe that my first executable is legal.

Which one is it?


I do not buy the premisses. Your example states that it can provide the full text of any book. An LLM cannot do that. They can produce something in the same style and setting. When an actual human author mimics other writers styles, it is not illegal, why exactly should it be illegal for an author to use an LLM to do it?


> When an actual human author mimics other writers styles, it is not illegal, why exactly should it be illegal for an author to use an LLM to do it?

There is a fundamental difference of scale. Say I write a blog post about some technical thing I know. You read it, learn from it (and other sources), and then you write your own blog post with your understanding. You may link to my post (if you believe it is heavily inspired from there), or not. But reading my post, understanding the material to the point where you can write about it took time. If you are so fast that you can produce more posts than I do, and your posts are the same quality, then suddenly you become a better blogger than me.

An LLM can automate that. With an LLM, you as a blogger don't have to learn from my post, you can just have it generate the post (you don't even need to read it). In that case, you have put absolutely zero effort learning about the topic, you just had some algorithm copy-paste-and-modify my post. Don't you think that if you publish a blog post that you haven't even read yourself, probably it comes from material written by somebody else at some point?

In any case it is very different. Once a human does it, the other time a machine does it. I don't understand why engineers think we can just transfer concepts between both. You as a human have rights (maybe you can vote), but your LLM cannot. Why couldn't an LLM vote if humans vote? Precisely because it is different, and it creates new problems.

The fact that LLMs can automatically (and at scale) launder copyrighted material is a new problem. Therefore we need to think about that, and to think about whether or not this should be legal.

The day people stop reading my blog and instead use ChatGPT (which is trained using, among the ton of other copyrighted material, my blog), and as a result nobody reads my blog directly anymore, I can assure you that I will stop writing. If 99% of books are written by ChatGPT instead of actual authors, but based on books from actual authors, in such a way that the readers don't really make a difference and therefore go with ChatGPT (because it's cheaper), what do you think writers will do? If they can't live from writing books (because ChatGPT does not buy their books, and people now pay ChatGPT instead of buying the books), then they will just stop writing.

It is a risk that needs to be considered. And it's certainly not an engineering question.


I do not disagree with any of what you wrote. That is also an entirely different line of reasoning than your first argument.

That said, LLMs today, cannot do this in a meaningful way. If an author cannot write a better book than ChatGPT, then that author would not be able to live of their writing anyway. And the authors that use ChatGPT to write a book, but still put the effort into fine tuning it, will not be able to this at scale. You also need someone to line out the plot and twists and turns if it is to be a full length book.

Let's assume that in 10 years, LLMs are at the point where you cannot distinguish between a well written book by an author and one generated by entirely by an AI. Suppose we have two authors, one that is long dead and whos works are public domain and a young one that is just starting. An LLM trained on the author whos work is in the public domain can generate books that is just like the original works. But what if the young author writes in a similar way, is that now legal or illegal to generate the same content? It's impossible to know if the young authors work have been used for training.

My take on it, is that LLMs are pretty stupid. They cannot come up with new and novel things. So if a writer writes something that is different (i.e. new and novel), how do we protect that? We cannot prevent it from being used for training, so the next logical step is to protect it the same way as we protect technology with patents. But that come with its own class of problems, say if two people write the same way independently, only one can have the right to do it. That is not the solution either.

I do not have the answer, but I am certain that trying to ban LLMs, or dictating what and how is not the answer. Perhaps the authors that can write in a new and novel way, and knows how to use AI will proliferate because they embrace it.


You didn't respond to this

> Let's assume that in 10 years, LLMs are at the point where you cannot distinguish between a well written book by an author and one generated by entirely by an AI. Suppose we have two authors, one that is long dead and whos works are public domain and a young one that is just starting. An LLM trained on the author whos work is in the public domain can generate books that is just like the original works. But what if the young author writes in a similar way, is that now legal or illegal to generate the same content? It's impossible to know if the young authors work have been used for training.

which is central to my reasoning. You could go the same way as software patents, but that is not preferable in any way.

---

>> I am certain that trying to ban LLMs, or dictating what and how is not the answer.

> I wouldn't ban LLMs because of copyright issues, though I would let authors choose whether their IP can be used for training or not.

> Why not? Just say that using for training is considered derivative work, and that's it. Now copyright owners just have to update their license to allow for training if they want to, and that's solved. Of course, Big Tech makes less money from that scenario.

Big Tech can train on everything that is "legal" and malicious actors can finetune with a specific authors works and then generate books. You will not be able to detect that and the malicious actor can claim to have written it themself. Then we're back to the starting point.


> We cannot prevent it from being used for training, so the next logical step is to protect it the same way as we protect technology with patents.

Why not? Just say that using for training is considered derivative work, and that's it. Now copyright owners just have to update their license to allow for training if they want to, and that's solved. Of course, Big Tech makes less money from that scenario.

> I am certain that trying to ban LLMs, or dictating what and how is not the answer.

I wouldn't ban LLMs because of copyright issues, though I would let authors choose whether their IP can be used for training or not.

However, copyright is only one issue with LLMs. All the black hat use-cases are a whole other category of issues. And I am of the opinion that technology is not neutral: IMO, it is perfectly fine for a society to ban a technology if it believes that it is globally doing more harm than good.


> Say I make a tool where you can enter the title of a book, and get the full text of the book without paying for it.

Let me introduce you to the Library of Babel[1].

But you need to know the hex! you complain. But that's basically how all of the "AI outputs copyrighted works!!!" gimmicks work. They're impractical unless you know exactly what you want it to reproduce. You can't just casually pick up a copy of Harry Potter like you would in a real library.

So is the Library of Babel illegal? What's the difference?

[1]: https://libraryofbabel.info/browse.cgi


Curiously that is a pretty close description of a normal neighborhood library.


Pretty close, and yet so different.

Don't you agree that I am allowed to buy a physical book and then resell it, but that I am not allowed to buy a physical book, make copies of it, and sell those copies?


The doctrine of first sale would never be allowed today if it hadn't already been enshrined in law over 100 years ago. Stakeholders would insist that creators wouldn't create if people could fund community libraries and thereby shrink the book market. They'd demand the same thing for physical books that they've implemented for almost all ebooks: a pay-per-lend system. The effect would be a substantially higher cost for each loan-out. Congress would go along, just as they went along with a new copyright term of min(120, life+70) years.

In the U.S., in 1909 when the doctrine of first sale was codified, the copyright term was 28 years + 28 years upon renewal, and copyright had to be registered. That would never be accepted by copyright interests today, either.


Great example! I think your library wouldn't be illegal itself, but distributing it would be illegal.


There's also the much more complex part of other jurisdictions. Japan already has largely said everything is fair game in AI.


The article cites the Google vs Authors Guild case (https://www.techdirt.com/2013/11/14/google-gets-total-victor...) which was a total victory for Google. This seems fairly conclusive that the textual analysis here is fair use to me.

> Similarly, Google Books is also transformative in the sense that it has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text — the frequency of words and trends in their usage provide substantive information.

Furthermore, is this actually AI training? This just looks like stats based on heuristics to me, I.e., garden variety sentiment analysis.


I think Google Books is cool, but "the frequency of words and trends in their usage provid[ing] substantive information" pre-dates Google Books by a long time. For example, there's a collection of word frequencies in the complete works of John Keats from 1917 [1]. Manually tabulated, too!

[1] https://catalog.hathitrust.org/Record/001023999


Sure, but the Google Books case is a massive and well-funded court case that was a pretty resounding victory for Google and textual analysis of copyrighted works in general, so anybody arguing that this is obviously copyright violation needs to explain why the Google Books case isn’t relevant.


IANAL. But, what's your consideration such that you think this usage of the books wouldn't be "fair use"?

e.g. in the US https://en.wikipedia.org/wiki/Fair_use factors of fair use are "purpose/character of the work", "nature of the copyrighted work", "amount/substantiality of the copyrighted work", "effect on the market for the copyrighted work".

The website shows a few statistics computed from a book, and a few excerpts from the book.

I'd think a consideration of those fair use factors favours the website: e.g. you're not going to look at those statistics/excerpts instead of reading the book. The website only shows a small portion of the book. The website's intention is to be educational.


I think the site had a better claim by far than AI training does.


There is no “AI” in the project. It literally just seems to do Bayesian sentiment analysis of books. It does NOT then mine that data to score OTHER books. Please actually read the article next time before commenting


I did in fact read the article. The author isn’t concerned about the book project. The author wants to use it as a springboard for complaining about AI fear, and his argument fundamentally hinges on the claim that AI training is fair use. He’s claiming the book project is comparable to AI training.

I think the book project has a better claim to fair use than AI training does, although excerpting entire pages is maybe too far (but the word highlighting does feel like commentary).


If I read the book, take notes about the number of times $x happened and later post them to my website, are you fine with that?

if I automate that process with a script that uses regular expressions, are you fine with that?

if I add a neural network to my script and it spits out the same numbers, are you still fine with that?

hopefully you see the madness in your argument.


I’m thinking specifically of generative AI’s fair use claims. I thought the book site (and all of your extension examples) was clearly fair use.


Why it wouldn't be? Indexing is fair use. I think where it gets murky with fair use would be if AI could actually plagiarize the book, but other than that, it should be fair use.


Indexing doesn't replace the book, it's just a different way to read it. The index is useless without the book itself.


Neither does knowing which parts of the book are darker or happier.


Unfortunately it seems majority of HN clings on "AI is fair use!" the same way GIFs (of movie and TV snippets) are "perceived to be fair use" (but is simply not fair use in most cases).


"majority" and "clings on" seem entirely arbitrary and perjorative. Why not make your actual point instead of just beating up a straw man?


And the other half of HN loses their mind when AI is brought up, completely missing the point made quite clearly in the OP that _no_AI_was_involved_in_any _of_this_.

The AI hysteria on this site is ridiculous.


The ai moving goal! No 2023.ai was used. Maybe some 1993.ai. Chat GPT will be considered “just stats” in 2033.


Why do you feel that the Google vs Authors Guild ruling cited in the article isn’t relevant?


There is no AI training here. Did you read the article?


I honestly don't understand how people can refuse the idea of that some parts most certainly can be reviewed with the help of a computer. There wasn't a "how good is this book" score because a computer might not be able to tell that yet, I don't understand the issue in looking at the number of adverbs in a book with the help of a computer.


Software people are intensely susceptible to the McNamara Fallacy:

“But when the McNamara discipline is applied too literally, the first step is to measure whatever can be easily measured. The second step is to disregard that which can't easily be measured or given a quantitative value. The third step is to presume that what can't be measured easily really isn't important. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide.”

— Daniel Yankelovich, "Interpreting the New Life Styles", Sales Management (1971)

https://en.m.wikipedia.org/wiki/McNamara_fallacy

This is often justified as an attempt to be objective, but I think the actual reason is that it’s hard to deal with unquantifiable properties.


Is there a counter fallacy along the lines of "if we can't measure everything, we shouldn't measure anything"? It can still be an interesting, fun and informative exercise while keeping in mind that it may be limited.


> I don't understand the issue in looking at the number of adverbs in a book with the help of a computer.

I think that authors see an issue with their IP being fed to an algorithmic black box that will certainly make a lot of money, but authors won't see any of it, and for which engineers say "don't worry, it is not stealing your IP, just let us make money the way we want and shut up".

What if OpenAI could prove that their algorithms are not stealing IP? I think it's easy: they just can't. They don't even formally know what their algorithm can and cannot do.

If I was an author or an artist, I would most definitely want a new kind of legal way to make sure that if I don't want my work to be used as training data, then it most definitely isn't. Of course that's not really possible, because who can audit the algorithmic black box and see if my work was used in the training?


How was the tool in question, the tool the book authors were outraged about, "stealing their IP"? It's also unclear how it would make any money at all -- indeed, the tool's author even said it made no money.


I believe that the authors are not fighting against that tool in particular. They are fighting about generative AIs being trained with their material without their consent.

Which is totally legit to me. Maybe this tool is just a collateral damage of a much bigger debate, I don't know. The fact is that in the bigger debate, it does feel like engineers don't seem to care much about the artists. Why would the artists care about individual engineers and individual use-cases?


Someone should recreate the website and change the word "AI" to "underpaid worker" and everything would be fine.


"crowdsourced"


I just think it's really funny that a third or so of the article is the author struggling to figure out why this would be useful to anyone.

> scanned and analyzed a whole bunch of books and would let you call up really useful data on books. [...] Frankly, all of that sounds amazing. And amazingly useful. Even more amazing is that he built it, and it worked. It would produce useful analysis of books.

> This is all quite interesting. It’s also the kind of thing that data scientists do on all kinds of work for useful purposes. Smith built Prosecraft into Shaxpir, again, making it a more useful tool.

Author's general illiteracy aside, he's really giving the game away here. I can't even think about the ethical implications of the project, because why would I care to count the number of adverbs and passive voice in all books ever, and why would you need a state of the art LLM-powered AI to do it?


The ethical implication of indexing books?

Also where do you read the author of the website was using LLM?


he is not even using an LLM, which is kind of the point. the tool is thrown in and judged as an AI tool when it is just simple statistics that anyone who studied some math can build with a little bit of effort.


Imagine seeing the trends in books throughout certain time period, wars, etc. Or larger trends over the history of all written works. Or all kinds of other neat and useful information that can impact decisions we make today. Do you think all analysis of things you aren't personally interested in is useless?


I hate "did you even read my comment" replies, but again, you're pitching me on "trends throughout a time period", "larger trends throughout history" and "all kinds of other useful information that can impact decisions".

I don't think a person without interest in literary analysis would make this comment. On the contrary, I'm specifically calling out building non-sensical tools with little practical uses because the author of the tool does not engage with literary works beyond the most superficial layer. And because I don't believe quantitive analysis of art is benificial to culture and society, but that's a broader subject.


I generally appreciate TechDirt but this is a very weak argument, the logic is inverted. That some people who are opposed to LLMs were wrong to attack a tool that has nothing to do with LLMs does not allow us to conclude they are wrong about LLMs.


The tool looks like it was useful to a certain kind of person. If it would actually make money, I would gladly (and probably easily) replicate it because I don't really care that much if Internet randos hate me. But I don't have a good idea for how to keep the return / effort ratio high on this.

I don't need consent for a lot of this, and I probably wouldn't bother. If I made a "List of books with terrible sentences" I wouldn't ask for opt-in or even bother contacting the authors. I will just make the list and quote the sentence.

The law and public opinion is on my side, though I only need the former.


Copyright issues aside.

My personal opinion is that these tools are mostly useful to suck the "soul" out of a book. They give you templates and stuff and useful statistics to help you go to the lowest common denominator.

The problem is more visible in the movie industry, where they have had script templates for a hundred years now (actual time interval pulled out of a*), but it's starting to show up in books too.

For those happy to "consume" Netflix series and Marvel movies that are indistinguishable from each other except maybe filmed with different actors, it should be fine.

If you want originality in your entertainment it's sad news.


I wonder about a parallel for paintings. What if there was an analysis stating exactly the brushes the painter used, the number of strokes, the exact pigments, etc? Would that, in your opinion, "suck the 'soul' out of a painting"?

I could see this as a brilliant learning tool. A tool to provide deep insight into something that would be very challenging to quantify personally. I think all this would make future authors better, not worse.


The cave paintings in France have been studied this way, starting with Leroi-Gourhan's work and then accelerating with the use of computers. It's defi itely shed some important light on the artists who made them tens of thousands of years ago, and I don't think it made the paintings any less wonderful.


> Would that, in your opinion, "suck the 'soul' out of a painting"?

Out of the analyzed painting, no.

> A tool to provide deep insight into something that would be very challenging to quantify personally.

Yeah, and will prevent the would be creator from developing a personal style. No matter what medium they create with.


Does an artist not see thousands of artworks as they develop their style? Do artists not experiment with hundreds of styles as they hone in on something they want to be theirs? Is the internet, a tool that lets them see even more artworks, preventing creators from developing their personal style?

As I understand it, we've seen an explosion of different dancing styles as various apps let users share their craft with others - a quicker evolution of a greater variety of dancing. And I'm confident dancers use technology like playing videos at slower speeds to analyze others' movements.


You're right, but there's a large difference between your brain doing the analysis and a template or computer wizard telling you what to "create".

[Link picked at random] https://www.studiobinder.com/blog/three-act-structure/

Do you really think this did wonders for movie originality?


I've heard the phrase "you have to know the rules to break them" (or something similar). The next generation of creators may look at that and choose to change their story telling on purpose, or exaggerate the curves, or something else ... that's what creativity is about - trying variations.

Just because Hollywood is pumping out movies you find unpleasant doesn't mean art is dead.

You and I probably agree more than this conversation suggests. You're right that people can and likely will stick to patterns that are successful, and that will mean there may be more copy-cat cookie-cutter stuff out there. But I strongly suspect that it will allow other creators to rocket past old mistakes or common tropes towards something novel.


The bar for calling yourself a writer or an author is incredibly low. Doing so doesn't qualify you to have an opinion on much of anything.

These people are just joining an outrage wave in an attempt to advertise, e.g. "How dare this project touch BOOK TITLE written by Me! (chain of hashtags)". They're not concerned at all about what they are breaking along the way.


I like this article about internet outrage being bad that seems to exist solely to use that conceit to try to drum up internet outrage


I think it's wise to take the concerns of the creative community seriously - after all their "labor of love" [1] matters immensely, without it LLMs are useless.

matters not how much the coder "loved" the project, or did yoga, or that they've not made money for years, after all most book authors aren't exactly raking in the money either.

also like many thinks in life, some tools/projects/startups etc just stop being needed/used and new ones/competitors take over. there's nothing to say that since tool X is using A.I. therefore it has to be adopted by one and all; smiles all around.

google has 'right to be forgotten', also looking into 'machine unlearning' and it's common for platforms to honor user's request to remove their data / close their account.

[1] From OP: destroying what had been a clear labor of love and a useful project


The thing with this project is that it had no conceivable way of threatening the original authors, financially or otherwise. The analyses it produced weren't replacements of the original works, and served as a writing tool, not as something that generated new content.

In this case, the shutdown seemed to hinge entirely on the irrational fear of anything with "AI" written on it and an unshakeable conviction that this clearly transformative use case wasn't fair use.


>> had no conceivable way of threatening the original authors, financially or otherwise

how so? what's inconceivable about it?

>> seemed to hinge entirely on the irrational fear

how are the authors "fearing without reason" or "illogically fearing" ?


> how so? what's inconceivable about it?

Authors make money through sales of their work. This tool was a writing aid that analyzed text and included some copyrighted works in its dataset. There was no way of retrieving these books in full, and the excerpts that were allegedly shown to the users were used in an analytical context, unlike the original works. So, this website couldn't replace ownership of the actual book for readers, and had no capacity to hurt their sales. Basically, the way the service used this data should be transformative enough as to not have any impact on the authors.

> how are the authors "fearing without reason" or "illogically fearing" ?

I called it "fear" because there was no strong argument on the authors' side as to why this tool is bad. I called it "illogical" because I think that it's no coincidence that this controversy only came up now, in 2023. Back in 2017 and onward, the existence of this tool didn't appear to generate pushback. My pet theory is that in 2023, now that we have good generative AI, the advancements have spawned an entire subset of people that view anything "AI" as inherently tainted and immoral. The original complaints seem to lack understanding of this tool and multiple people have conflated it with generative AI, despite it having nothing to do with that.


>> Basically, the way the service used this data

it's not about the past (2017...), the authors are concerned about how the dataset could be or is likely to be used from now on.

many tech projects these days have or are thinking about integrating third-party A.I. providers in their services, either to harness the power of their large datasets or their large user-base. I think it's great if authors/users opt-in to this, but likewise I agree with those that want out (opt-out).

>> I called it "fear" because there was no strong argument on the authors' side as to why this tool is bad

their argument doesn't have to be a peer-reviewed journal, it suffices to say "i don't want my books in your dataset"

>> I called it "illogical" because I think that it's no coincidence that this controversy only came up now, in 2023. Back in 2017 and onward, the existence of this tool didn't appear to generate pushback ...

6 years have passed since 2017, life moves on, it's natural for things to change e.g: the project's code, updating servers, partnerships, emergence of third-party tools/libs/services etc etc.


> the authors are concerned about how the dataset could be or is likely to be used from now on.

The service in question wasn't introducing anything new that'd appear to justify all the recent pushback. It feels like you continue generalizing your statements, while the discussion topic is about what made Prosecraft specifically so preposterous that it warranted the outrage.

> their argument doesn't have to be a peer-reviewed journal, it suffices to say "i don't want my books in your dataset"

It's kind of a blunt statement, but why should they have a say? For example, say I create a website where I publish technical analysis of famous literary works, including basic statistics about a book and a review. Should the authors be able to just take that down? This use is legally protected (as is creating a dataset), so allowing authors to restrict this use seems as arbitrary as allowing them to say that no person can ever bring their books into the country of Moldova or that no one over the age of 50 may read it.

> it's natural for things to change e.g: the project's code, updating servers, partnerships, emergence of third-party tools/libs/services etc

And yet, in this specific situation, all of this is conjecture. Nothing about the project changed in some significant way in 2023 that would warrant this. Further proving it is that the people that are against Prosecraft don't seem to bring up any specific changes or reasons for their stance, only that it is "AI".


as I said earlier "6 years passed" since 2017 and the arrival of A.I. is causing authors (to say the least) real concern. the industry/ecosystem around X is enough to change to affect those associated with X.

you offered a "pet theory" (as you say) further up, but not much by way of proof that their fear is irrational.

i think we've reached in impasse here, the points are re-cycling.


> the industry/ecosystem around X is enough to change to affect those associated with X.

I think that you keep trying to generalize this argument because there's nothing that's strongly damning about Prosecraft. If these authors have a concern in regards to newer AI tech, they should address these concerns to services that actually implement that tech, not a tool that has peacefully existed for years and is guilty of nothing but being vaguely associated with these newer developments. It's like if Ford released a new car that could go 1000km/h, and then I, in protest of its obvious lack of safety, go to a museum and smash up a Model T. If that's not irrational, I don't know what would be.

> you offered a "pet theory" (as you say) further up, but not much by way of proof that their fear is irrational.

There's no concrete proof because I didn't find anything that I could even try to disprove. If you read through the anti-Prosecraft posts, they lack substance in bringing up how exactly this service can financially harm authors. The only complaint seems to be that the authors don't want it - even though the use of their works appears legal and is perfectly consistent with what many other services have done in the past. Thus, my only conclusion was that at least some of the hate stemmed from Prosecraft just being vaguely "something with AI".


One thing that stuck out to me after reading the initial vitriolic Twitter thread from the author saying "How dare you!" is that the author is a lawyer by day, which I find interesting, since they seem to be very uninformed about copyright and fair use.


The concept is fabulous, and I do agree that there were some gut-kick reactions. Any tool that can help a writer progress their craft is a welcome addition, as far as I am concerned, but there are consequences that probably even the creator did not think about. Tools are never used the way the creator envisioned. Being "cool" is not a reason to embrace. I can see both sides, but I really did not understand how the tool worked, and until I do....I am going to struggle with which side of the fence to land.


open source it on IPFS and walk away, maybe call it something else and just drop the hashes on 4chan anonymously

this reaction deserves the inevitable, as his only problem was it being attributed to him


How can he do it anonymously when we already know he's the only one who has it?


because nobody has seen the code, but everyone has seen what it does and can think of multiple ways to do it with AI


Lie and claim to have been hacked.


> I know we live in an outrage era, where the immediate reaction is to turn the outrage meter up to 11. I’m certainly guilty of that at times myself.

I’ve learned to feel my outrage, while refraining from expressing it.

I will usually avoid using anything more than vague terms to express ire or rancor, and reserve my eloquence for positive commentary.

Some ITGs probably think that I’m a “wuss,” but I learned, long ago, not to poop where I sleep.


I'm not a lawyer, but neither is this Mike guy. I'm quite suspicious about how confident he is in stating that all of this is legally fine. From the creator himself:

"When I ran out of books on my own shelves, I looked to the internet for more text that I could analyze, and I used web crawlers to find more books."

I'm annoyed by the phrasing. Because you're obscuring that you pirated commercial books. A commercial book is not open or public data. You're supposed to pay for it to get access to it. This assumption that just because something can be found on the internet, it's cool to just take it and do whatever the hell you want with it, is worthy of push-back. Especially in the case of data that isn't public at all (not by intent anyway). This is not the same thing as scraping Wikipedia, which allows for it and has supportive licensing for it.

I believe that in the majority of cases where we're talking about entirely new use cases such as big data, deep analysis, AI training, etc we need to move to an opt-in model. Unless explicitly specified, you have no permission. The opposite situation is ridiculous.

I do agree that the mob-like type of criticism deserves criticism in itself. Trigger words like "AI", "crypto", "white techbro" converting adults into sadistic bullies is a sad thing to watch.


Let's split the discussion though.

The authors clearly didn't support the idea of the project, regardless of the source for the data.

How the developer acquired the data is a different discussion. Unless you have clear proof they pirated the content, why disparage them?


AI is going incredibly fast, but package management is going incredibly slow. I have a beefy GPU and I should have all this AI stuff at my fingertips, but package managers are not working along and I'm too tired to get into another dependency hell.

This is what's killing useful tools more than fear of AI.


as soon as we can find an economic paradigm that doesn't rely on taxing the trade of exclusive assets we can have a digital renaissance.

the thing with digital information is we can all copy it. the problem is that our economic paradigm doesn't deal with infinite-copying well.

so far the only 'solution' I see, is figuring out ways to restrict, limit, and restrict the copying capabilities of digital stuff. which is not a solution, but a rejection of the 'novel' possibilities enabled by digital technologies.

more cynically, it's not a rejection, but a capture. a few get to reap the huge benefits of digital copying, while the rest are relegated to collecting, trading arbitrary tokens and using them to pay for the permission to copy digital assets; which is completely dumb, I will die on this hill


Who'd have thought a bunch of authors would be offended by critique and/or analysis of their work by an AI? Seems petty, I guess that's because only the best works get heavily scrutinized, e.g. Shakespeare's plays and these are hardly that.


If its fair use to find stats on a book/corpus of text, isn't it also fair use to run a transformer/something similar, since really its just predicting/modelling what that piece of text would output next, given context?


Knowing the percent of adverbs in a book is not helpful to any writer. No writer who is worth any ink they write with thinks in percentages or thinks of any of the nonsense in that article.

P.S. Stop stealing shit from writers to train your thing.


I believe that if you are using my book as data/model/whatever for something that makes money, then I want a piece of the action.

If it's just a paragraph, fine go ahead. But the whole book? Give me money.


Feels like a nightmare. What about a book club passing around a copy of your book. Want to charge them licensing fees? Story time at a library for children. Licensing fees for them too? Review in a pay-walled newspaper. Fees, again? NYT derives value from all those reviews you know.


So you want money from people who read books and writes reviews? I don't think that's reasonable.


If I read a book and learn something from that I have updated a neural network in my brain. You don't get paid for that (I am not, as an example, sharing any of my income with the people who wrote my college books).

If I write a non-fiction kids book about a boy who finds out that he is a wizard and goes to a secret wizard school, I don't owe JK any copyright money.


At two extremes, say:

If I read your books and use them as a strong foundation from which to write stories (e.g. I don't do any worldbuilding, & just re-use settings/characters without my own introduction), it seems reasonable that my derivative would need some kind of license from you.

If I read your books and write book reviews about them (including small excerpts from the book), it seems unreasonable to require your explicit permission.

I think the website mentioned (which displays stats about a book, and a couple of excerpts) falls far closer to the latter case than the former case.


you want money if your book is used for literary analysis?

and even if you could get money for that, it would not be much. if the tool analyses a million books, and then the service makes one million in revenue, no more than $1 of revenue can be attributed to your book. and only a fraction of that would go to the book author because most of the revenue goes into developing and maintaining the tool. so you'd get 10c. you would make more money by spending your time writing actual prose that can be sold, instead of complaining about this particular tool.


No.


Going all the way back to the Luddites, there will be those who seek to misrepresent their opposition as being “afraid of progress”.

Much easier to find a friendly podcaster than engage with the actual difficulties with what you’re doing.


It looks like the author of the tool took the it down himself, voluntarily. If this even counts as "killing" the tool, the author did it, not fear of AI.


Quite apart from any content/copywrite issues, this wasn't a useful tool.

It was a badly-made tool that didn't identify anything valid or worthwhile. The measurement of passive voice is a particularly clear example: that's not what passive voice is at all, and anyone who understands passive voice knows that it shouldn't be frowned upon [1].

"Just add tech" is not a recipe for usefulness.

[1] http://www.lel.ed.ac.uk/~gpullum/passive_loathing.html


Similarly, saying that something is not true does not necessarily make it not true.


Yes, that's why I provided the very clear example of passive voice.


Right, but passive voice also does not render your assertions necessarily true, though it may be able to increase the likelihood of them being perceived as such.


> The Gizmodo article has a ridiculously wrong “fair use” analysis, saying “Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’” Except… it almost certainly does? Again, we’ve gone through this with the Google Book scanning case, and the courts said that you can absolutely do that because it’s transformative.

I could be wrong but I'm pretty sure Fair Use doesn't mean you can download a dump of Library Genesis and feed that into your system!


You are wrong.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: