Fear of AI just killed a useful tool

actuallyalys · on Aug 9, 2023

I found this article frustratingly vague on how prosecraft.io actually worked. As far as I can tell, the author scraped the web for books, including in-copyright books. Then he analyzed it with techniques based on "classical" natural language processing techniques, rather than transformers or deep learning. He appears to have retained the books he scraped for future analysis. The site itself seems to use only snippets.

However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

For what it's worth, the Computational Story Lab's hendometer [1] seems to have largely out-of-copyright books from Project Gutenberg, plus the Harry Potter series.

[0]: https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

[1]: https://hedonometer.org/books/v3/863/

Edit: Apparently he was working on an LLM project. https://twitter.com/stealcase/status/1688721685585809408. It's unclear whether he was planning to use the books he scraped (although as @stealcase points out, GPT-Neox itself was trained on books that were pirated).

randallsquared · on Aug 9, 2023

If he says he didn't do something, the pitchfork mob will simply tell each other that he is lying. They will do this in the most confused and twisted way possible, driven by lack of understanding of what was happening combined with a need to drive outrage and thereby advertise their work.

If he says he didn't intend to do that thing, this is still compatible with a later update that he didn't do that thing, but immediately dampens the outrage machine. The reader who knows nothing about either side relaxes -- "No need for me to get worked up, because we won". Conveniently, saying he didn't intend to do the thing is also compatible with a later reveal that he was doing the thing (perhaps for later release, since he wasn't clearly doing the thing here).

Therefore, regardless of whether he was doing what he was accused of doing, this is the lowest energy response, and probably the default unless this was the hill he wanted to defend.

bbarnett · on Aug 9, 2023

The best response, for us all collectively, is to always ignore everyone's opinion online. There is zero value in anything on reddit, twitter, facebook, the media these days.

Just ignore it. All of it. Outrage or not.

I see downvotes, but I mean it. You know who you listen to? Your friends. Your neighbours. Your local community. You listen to PEOPLE, not sockpuppets. You listen to legitimate human beings, not AI generated blather, or curated news stories, or groups working together to generate hate, outrage, to stoke anger, upset.

You listen to actual, real PEOPLE.

You want to go to reddit? Twitter? Anything? Fine. But treat it as 100% fiction, pure entertainment, and never let it affect YOU.

hughesjj · on Aug 9, 2023

> Your neighbours. Your local community. You listen to PEOPLE, not sockpuppets. You listen to legitimate human beings, not AI generated blather, or curated news stories, or groups working together to generate hate, outrage, to stoke anger, upset.

You either have a significantly better social circle than I do or are glossing over a bunch of nuance.

Some of my family back east have been getting their brains rotted by fox news and the likes for literal decades now.

Some of those family members were also shaped by insular groups (their 'community') with fucked up thinking

As for the friends who I choose to associate with today, I indeed like and prefer their opinions (probably why we've self selected to be friends), but I assure you they have nonzero exposure to online communication.

I agree Twitter is shit and always has been due to the lack of context, but I often prefer reddit to real life conversations because I can actually take my time to think about my thoughts and even dig up sources. It also exposes me to more diverse viewpoints than I'd get solely in my physical circle of friends.

Now, I don't really use reddit after the most recent shenanigans, but here I am back on hackernews like I was slashdot before reddit. I don't think I'd ever want to abstain completely from online communications or forums.

bbarnett · on Aug 10, 2023

I said nothing of abstainment. In fact, I am posting!

Ignoring online comments, especially criticism, does not mean disregardment. And note, context is important. Note what I am replying to.

Simply put, on a medium where one person can sockpuppet appear as 1000, where one person can rally 1000 useful idiots with one disingenuous post, one cannot care what is said.

Ignore it.

We already have 30 year old adults, trying to discuss political nuance online, not realising that they may be piled on by a dozen 8 year olds. People presume the person behind the text is real, the person is their approximate age, or at least an adult, that the person is debating in good faith.

None of this is necessarily true, and in any large group of responses, the above chicanery is happening.

No one should care what a bunch of "people" on Twitter say.

slg · on Aug 9, 2023

I'm genuinely curious, why do you post here if you have this mindset?

bbarnett · on Aug 9, 2023

This is not twitter, with its tiny little snippets of text, which are useless for meaningful communication, and its culture which incites groupthink. This is not Reddit, with its hostile, hate filled voting system, with its peer pressure laden culture. This is not Facebook, literally designed to drive hate, and anger, and upset, to increase engagement.

This is Hackernews. It's not perfect, but it's far more palatable. And it's certainly not like any of the above.

Lastly, my advice still applies. When I detect hate here, I ignore it. When I detect peer pressure, I don't care.

slg · on Aug 9, 2023

Your two comments here could each fit in a Tweet and Reddit has basically the same voting mechanism as HN. There is also plenty of groupthink here too. That was basically the reason behind my comment to see what distinction you are making. People here seem to often deride social media while pretending HN is not social media.

HN is effectively just Reddit with only a few subreddits and moderation that is stronger than average, but within the range you will find on certain subreddits.

bbarnett · on Aug 9, 2023

And an ink quill and a ball point pen, both write ink.

ryandrake · on Aug 9, 2023

Well, there is hate and then there is criticism, and lots of people confuse the two, to the point where they just label any critics as "haters". Hate is worth totally disregarding, but criticism, as long as it's genuine and done in good faith, it's probably valuable.

bbarnett · on Aug 9, 2023

One of the problems is, if you're .. I don't know, say in your neighbourhood. Or maybe at a small local gathering. You do something which deserves criticism, or mild, but polite rebuke.

You hear it a few times.

Online, you may hear 100k of the same, and most of the politeness and empathy is gone. It's all about scale. About reach. And about the more of that there is, the less empathy there is.

saikia81 · on Aug 10, 2023

I still remember the days when I saw Reddit as simple entertainment. in the last 10 years its reach has only grown. I identify with you sentiment, but have also become entrenched deeper inside the mindset of the online. The top comment tells the truth. The votes count. A judgement based on groupthink instead of individual discussion.

Many here don't agree, but I wonder if their attention to the beautiful divide between people has been obscured by groupthink.

4Chan /b/ is such a toxic community that I never spent long there without remembering the words that reflect its value at the top. >The stories and information posted here are artistic works of fiction and falsehood. Only a fool would take anything posted here as fact.

fenomas · on Aug 9, 2023

> However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

This seems FUDdy. "Intend" isn't in the apology at all, and the wording that is there says clearly that generative AI came after prosecraft, so there's no way the tool could be used for it.

> It's unclear whether he was planning to use the books he scraped

This also seems unwarranted. The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?

actuallyalys · on Aug 9, 2023

I apologize for the quotes around intend. I wrote it without, then I forgot it was a paraphrase and added them back again. Unfortunately, I cannot edit my comment to fix that.

I do think “intend” is a reasonable paraphrase of “never wanted to.”

(Edited to add) I don’t think prosecraft was a finished project and he was definitely still working on his other tool for writers that incorporates some of the same tools.

> The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?

The reason being that he had mentioned he was planning to use the scraped books for future analysis.

Tokumei-no-hito · on Aug 10, 2023

So you can edit your OP and this comment but can’t edit “intend”?

fenomas · on Aug 10, 2023

HN comments cannot be edited once two hours have past.

h11h · on Aug 9, 2023

Even Facebook's Llama was trained on books3, a dump of pirated books.

stusmall · on Aug 9, 2023

It's so mind blowing to me that it made it past corporate legal. I don't get what defense there could be besides "lmao try and stop me, nerds"

imchillyb · on Aug 9, 2023

"...but, but, that's illegal." Jenkins huffed out in an almost whisper.

"Jenkins, take out your wallet, and place it on the table. If you speak, I'll fire you on the spot. Do it." the CEO spat.

Jenkins trembled with rage but reached his right hand deftly into the inner folds of his suit, and produced his billfold. He waved it a bit and set it in front of him.

"Good, now push the package forward, and keep silent." the CEO growled at Jenkins.

Jenkins pushed his wallet out further from himself, and with a last shove pushed the wallet beyond his fingertips. Jenkins wiggled his fingers and sat back, raising his eyebrows at his bosses boss.

"Harold!" CEO Barcliff barked. "Take the package to logistics."

"Right away, sir." Harold nodded, retrieved the package and left the room.

"What the hell Barcliff!" Jenkins exploded.

"I suggest you call the card companies, and the DMV, for replacements." Barcliff stated in a deadpan low growl.

"Screw you Barcliff, call him back. I want my wallet. That's my life in there man!" Jenkins now yell-whined.

"Connely, Mirasu, Fender, Scotts, did any of you see this wallet Jenkins is whining about?" Barcliff almost sneered.

"No sir, I observed Jenkins deliver a package to Harold to be taken to logistics." Scotts spoke out before the others could speak. Nods all around though.

"You see Jenkins, illegal is only what you can prove. You can't prove shit. Just like they can't prove shit. Now get out of here before I fire your ass. Don't forget to say hi to the wife for me." Barcliff purred.

rcxdude · on Aug 10, 2023

Fair use is basically the whole defense.

yencabulator · on Aug 10, 2023

Fair use only applies to what you publish, so your new work B can be seen as not to infringe on the copyright of the original work A.

What about the part where a Meta employee used Meta computers to download copies of the books?

Palmik · on Aug 9, 2023

I am a bit confused about what's so outrageous about this tool. It seems that both the book authors, and some of the people in the discussion here, conflate rudimentary statistics about a book (number of words of certain kind) with the latest wave of generative AI. They are very different in both what value they provide, and what risk they pose to book authors.

The tool that book authors got outraged about only provides basic metrics, not dissimilar from other metrics such as "page count", and can't be used to produce new content which could deprive the book authors from revenue.

rockemsockem · on Aug 9, 2023

If you read through the angry Twitter thread it's clear that almost everyone thinks that either a) the site is a pirate site that lets you download books or b) that the site lets you generate works in the style of an author. Neither of which is true of course.

There are a handful (like < 3 people) who seem to understand what the site actually does who were still angry because the creator seems to have pirated the books. I actually don't know about the legality of something like that. Surely providing pirated books is illegal, but IDK if acquiring pirated books actually is.

I think it's clear though that most of the outrage would still be there even if the author had purchased each and every book.

JohnFen · on Aug 10, 2023

> I actually don't know about the legality of something like that.

Techdirt's analysis of the legality seems correct to me. TL;DR is that it seems legal.

golemotron · on Aug 9, 2023

[flagged]

PakG1 · on Aug 9, 2023

Having now looked up what the word priapism means, I'm now trying to imagine what moral priapism would be, and I can't think of a definition that makes sense. What is it? Is there a simpler word you can use?

RupertEisenhart · on Aug 9, 2023

GPT4 sez: "This sentence uses figurative language, so it doesn't refer to literal meanings. "Fear" is a stressful emotion induced by perceived danger or threat. "Acute" suggests a sudden onset or intensity. "Moral priapism," a more abstract concept, suggests an abnormally extended or excessive fixated morality, much like the medical condition priapism refers to an unwanted, enduring erection.

Overall, the sentence means that fear can lead to an exaggerated or obsessive moral response. This implies that when someone is afraid, they might stick rigidly to their moral code or make moral judgments more extremely or rigidly than they would in a state of calmness."

seems fair

Edit: My gloss of the summary: "Fear creates a hardon for scapegoats among the pitchfork wielders."

runamok · on Aug 9, 2023

Permanently aroused to look for any injustice.

jmull · on Aug 9, 2023

If you want to do this kind of thing, let authors opt-in (or publishers).

Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.

I suspect the authors are put-off by a couple things:

- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.

- the use of these scans in a commercial product

The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

bogwog · on Aug 9, 2023

> If you want to do this kind of thing, let authors opt-in (or publishers).

If it's fair use, why should you have to do that? The same copyright law protecting author's ownership rights over their art also provide "fair use" to other people. Someone may disagree with current fair use laws (and I suspect many outraged here do not), but that's a broader issue not related to this particular tool. It just 100% seems like misdirected AI outrage.

> the text of the works scanned seems like it may be from pirated sources.

Do you have a source for this? I didn't see that mentioned in the article.

jmull · on Aug 9, 2023

> Do you have a source for this? I didn't see that mentioned in the article.

The person who runs prosecraft says "I looked to the internet for more text that I could analyze, and I used web crawlers to find more books." [0]

I'm just inferring, but if they had, say, purchased each of these books, or borrowed them from the library, or only sourced from sites that ensure the copyright is satisfied, then they might have mentioned it.

(FWIW, the blog post says the other source for the 25K works was their personal library, so I'm assuming the bulk of the 25K come from the internet, though I know some people have prodigious personal libraries.)

[0] https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

JohnFen · on Aug 10, 2023

> If it's fair use, why should you have to do that?

You may not be legally required to do that, but it can be an excellent move that benefits you nonetheless.

Much like how Weird Al isn't legally required to get permission to make a parody of a popular song, but he does so anyway.

But in this case, I don't think you even need to invoke Fair Use. I think what he did simply isn't a copyright violation in the first place.

In reality, the legality of this was never the issue anyway. The issue was that doing this made the authors angry, and the dev didn't want that.

dingaling · on Aug 9, 2023

> also provide "fair use" to other people

"How much of someone else's work can I use without getting permission?

Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports."

https://www.copyright.gov/help/faq/faq-fairuse.html

Limited portions, not the entire work.

bogwog · on Aug 9, 2023

Quote from the OP:

> The Gizmodo article has a ridiculously wrong “fair use” analysis, saying “Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’” Except… it almost certainly does? Again, we’ve gone through this with the Google Book scanning case, and the courts said that you can absolutely do that because it’s transformative.

Relevant article: https://www.techdirt.com/2013/11/14/google-gets-total-victor...

giantrobot · on Aug 9, 2023

That's ludicrous. It's counting words in a book. You can't copyright facts and that is all the tool is doing. Pages that are reproduced are only excerpts which falls squarely under fair use.

It's no different than you checking out the book from the library and counting all the words.

ineedasername · on Aug 9, 2023

Copyright pertains to reproduction of the work. The statistics this tool provided are not reproductions at all. It did also provide quotes, which were not extensive and certainly not the entire work.

rcxdude · on Aug 10, 2023

Limited potions can be reproduced In the derived work you are distributing. Summaries and statistics of the work are almost certainly fair use.

macNchz · on Aug 9, 2023

> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

Going off of some of the tweets about this that initially whipped up the outrage about this…it’s not like they were making a nuanced case about their concerns, they were basically just stomping their feet and shouting.

jmull · on Aug 9, 2023

That's twitter generally.

If your engagement only reaches the level of twitter, you aren't really engaging at all.

dnissley · on Aug 9, 2023

So as long as that's all the engagement there is, we're free to ignore it and carry on, correct?

ineedasername · on Aug 9, 2023

I would think so. If someone is shouting & stomping their feet in the public town square about my project, but I never go anywhere near the town square anyway, I don’t think I’m going to shutdown my project. It’s just too bad the person who created this tool happened to walk through the town square.

jmull · on Aug 9, 2023

I think you're fishing for a way to dismiss the concerns of the authors without understanding or addressing them, which is pointless.

jasonlotito · on Aug 9, 2023

> If you want to do this kind of thing, let authors opt-in (or publishers).

"This kind of thing" is factual information about the book, such as page or word count, ly-adverb count, etc. Small snippets, something permissible under copyright law today, that were heavily editorialized and commented on were displayed.

To suggest that counting words and pages is something that should not be allowed is silly.

> The article itself is clueless…

Says the person making stuff up to force a narrative.

The person doing this had the rights to do this, and was very clearly within his rights to do this under copyright law. Counting words is not a crime.

tensor · on Aug 9, 2023

> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

The authors quotes speak for themselves. They very clearly and ignorantly claimed that this was an "AI training project" when it was nothing of the sort.

dxbydt · on Aug 9, 2023

> it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

https://twitter.com/scumbelievable/status/168915466478730444...

So the two authors who are gloating about "killed that stupid fuckin AI thing" - I'm supposed to be engaged with their concerns ? Please.

rockemsockem · on Aug 9, 2023

Statistical analysis is only useful if you have enough data to analyse, so there is in fact a threshold of number of books to cross before the tool can even really exist. If you read his post, the initial goal was to get stats about typical word count, typical amount of passive speech, etc. Requiring opt-in for these broad statistics, through outrage only since this project is CLEARLY legal in the United States, means that tools like this will never exist. Which seems net bad to me.

If you are saying it should be opt-in only for the pages analyzing specific books, like the instigator of this outrage screen-shotted, well that seems to fall squarely into the critical analysis bucket, so that is also quite ridiculous.

I understand some folks being unhappy that a portion of the works were pirated, but it seems like most of the outraged would be outraged even if he personally purchased each and every ebook.

Also, if you read through the Twitter thread a lot of the authors (not 100%, but a LOT) are doing a really great job portraying themselves as "stoopid AI-fearful luddites". Many of them think the site is somehow like ChatGPT and they don't bother to dig any deeper, or really at all.

dfxm12 · on Aug 9, 2023

Yeah, the article represents the voice of the authors in two tweets, from authors not apparently notable enough to have a wikipedia page. One I couldn't even find on Goodreads. It's obvious there's more to this than just the tweets presented. The article is unhelpful in this regard.

aredox · on Aug 9, 2023

Jeff VanderMeer is not notable enough?

fullspectrumdev · on Aug 9, 2023

Personally, I have no idea who he is except some loud prick on twitter.

paulluuk · on Aug 9, 2023

While I would agree in theory that a project like this would be best with opt-in, in reality that would just not work. Publishers would never opt-in to it, if they even respond to your requests at all.

wokwokwok · on Aug 9, 2023

Then don't do it?

Or, if you do it, do it privately and don't share it on the internet?

I'm not sure why this is a difficult idea; if asking for something and getting permission to do it is so difficult that 'would just not work. Publishers would never opt-in to it'

...then, it seems really obvious that even if you want to do it, technically can do it and you could maybe make a legal argument to doing it doesn't violate any laws...

...why would you do it? Why would you post about doing it?

Come on, that's literally being a selfish dick; spitting in people's faces and waving a 'too bad, you can't sue me' flag.

There are so many things, so many mannnny things that you could work on, why would you choose to pick something that you knew would upset people and you knew you wouldn't get permission to do if you asked?

ineedasername · on Aug 9, 2023

Why ask permission to do something that doesn’t require permission? I see no more reason why an author should be upset about someone counting the words in their book & assigning sentiment than a builder should get upset about someone counting the # of bricks in a building and assigning subtle color shade differences to them. Neither the author nor the builder has lost anything by it.

OkayPhysicist · on Aug 9, 2023

Should I need the publisher's permission to write a review of a book? Personally, I find that idea abhorrent. This sounds like an interesting project, unambiguously protected under fair use doctrine, both as analysis and as transformative, and the authors got their knickers in a twist because they are scared of that which they do not understand.

aoeusnth1 · on Aug 9, 2023

Because copyright in fact is not that strict (Google Books does far more) and you don’t need to respect someone’s boundaries when they don’t have a legal right to those boundaries. Why should we sympathize with people who want far stricter control over the cultural commons?

zarzavat · on Aug 9, 2023

Authors are not demigods, they don’t have a right to control the use of their works, only the reproduction.

When you publish a book you “consent” to the fact that people are going to take it apart, talk about it, review it, quote from it, and yes run statistics on it. If an author doesn’t want that to happen then they shouldn’t publish a book. Just keep it private, only distribute it to people you trust after they sign an NDA.

As far as anyone knows, no piracy has occurred. In the US you are allowed to scan books, index them, and post excerpts - it’s called Google Books and there was a big case that affirmed that it is legal. Downloading a book from a pirate website for the purpose of indexing by a computer program is not piracy, you have simply outsourced the scanning stage to someone else. It is only an issue if you download from some p2p protocol (such as a torrent) that also uploads and shares the book.

OkayPhysicist · on Aug 9, 2023

Because the authors were AI-fearful luddites. From "Book" to "Program that judges books" lies well beyond any argument that the use of the derivative work could supersede the original. It's such clear cut transformative use that the authors come across as grossly misinformed about copyright law as a whole.

Perhaps there is an argument for generative AI possibly superseding the original, in that people might start asking an AI to generate them stories "in the style of x" instead of buying the author's books, but this wasn't that. It was just some fun data analysis of books.

sampo · on Aug 9, 2023

Summary: prosecraft.io counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions. This is not generative AI, this is old school statistics.

And then it sometimes presented a page worth of quoted text from a book. Which should fall under fair use.

https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

gs17 · on Aug 9, 2023

> I don't think you even need fair use for this

You shouldn't, at least for posting basic statistics. They're facts, not copyrightable.

warning26 · on Aug 9, 2023

> counted word occurrences and presented statistics about them. I don't think you even need fair use for this, because this is something you obviously are allowed to do, without any permissions

You're pretty much describing exactly what an LLM "learns" about text. I agree that it should obviously fall under fair use, but as the author of this article found out, there are quite a few who (very vocally) disagree.

tensor · on Aug 9, 2023

I think there is a big difference in terms of data recovery though. You can't take a compression algorithm, for example, and claim that its "just some statistical analysis" when it can reproduce the original perfectly. Heck, even if it can reproduce it approximately, that's a lot different than what we see in this particular example, where the data could not be used to reproduce a text at all.

make3 · on Aug 9, 2023

generation of related text vs analysis of human understandable facts is very different in the mind of most people.

I think that using an LLM to get insights on the text should be ok, it's the generation part that scares them. probably rightly so.

t14n · on Aug 9, 2023

Hrm. It seems like the authors are caught up in things like "vividness" score and the "sentiment analysis" of the text; I guess because it's loosely related to AI?

But it seems like a bulk of the stats collected are things that I would find really useful. I've probably asked myself, "how many words are in this book" on 10+ separate occasions, both as a reader and as a writer.

It also seems like there were also counts of things like adjectives, verbs, adverbs, passive verbs, etc -- stats that I might want to know about a novel.

The bulk of the service seems rather "boring" and non-AI. Unfortunate that the whole thing was taken down because of a few features. Hopefully it'll come back.

keiferski · on Aug 9, 2023

For this particular example, the tool doesn't seem like it's a big deal. It just analyzes works for data. I'm not sure how this would be any different from a literary critic doing the same thing manually.

In general, though, I think artists would be less hostile to technological innovations if the people imploring them to "figure out how to embrace the technology rather than fear it" weren't actively trying to destroy their livelihoods, almost always without the slightest interest in helping them figure out the new economic situation. The attitude is, "It's the reality now, deal with it," all while enjoying the job security and high salaries of tech jobs. You can see the same attitude displayed when it comes to piracy: "too bad, deal with it, I have a good job, I don't care if you don't anymore."

This stuff would be received far better by the creative community if AI companies were to say, establish an artist sponsorship program, push for UBI, or otherwise show that they care even a tiny bit about the people they're making redundant.

bertil · on Aug 9, 2023

I agree with you. There’s a pattern that I see a lot, of having:

1. large powerful players doing something not entirely helpful;

2. victims of that protesting that change vehemently; all that in vain because the players are powerful and have sheltered themselves from criticism, usually via lobbying;

3. regulatory capture or protests go after a smaller player, which is widely advertised to accuse 2. of going too far — even when the problem in 1. is still entirely there, and now ignored.

It’s definitely the case with globalization (large conglomerate benefit, people protest, and a small artisan who started selling abroad is featured being victimized by tariffs), fossil fuel (large oil extractor, climate advocate, farmer seeing fertilizer prize go up), immigration, American cultural hegemony, car dominance over cities, etc.

That pattern allows larger players still doing harm to wash their morals. I feel like we need better antibodies to say: No, this does not absolve them.

pcthrowaway · on Aug 9, 2023

> if AI companies were to say, establish an artist sponsorship program, push for UBI

Sam Altman, for all his faults, is actually a massive proponent of UBI. I mean, that was one of the claimed objectives of Worldcoin (though he advocates for UBI in general: https://thewalrus.ca/will-universal-basic-income-save-us-fro... )

keiferski · on Aug 9, 2023

I am defining "push for UBI" as "actually do something to pressure the government" and not just state that a for profit business you've established is trying to accomplish that goal.

A4ET8a8uTh0 · on Aug 9, 2023

I will admit that I am mildly confused by this outrage, but it is X/twitter so the standards are different.

All that said, I remember doing basic text analysis in college and then sentiment analysis in my MBA class.. is the concern out there, because of how source material was acquired?

dahwolf · on Aug 9, 2023

Fully agree.

Not an artist myself, but this basic assumption in tech that you can just take somebody's shit without informing them, without permission, without compensation, without basic due diligence, and then go do whatever the hell you want with it needs to stop.

For the artists' sake but also for tech's sake. This model can't work, it's a complete dead-end that will wipe out livelihoods and culture.

But I can ensure you artists can/will be equally hypocritical themselves. Surely they've pirated themselves, removed paywalls from articles, blocked ads, borrowed the neighbor's Netflix account.

palata · on Aug 9, 2023

I think it applies to many technologies other than generative AI. How many devs actually think about ethics nowadays? I think it's all lost in the big companies they work for, behind the excuse that "it is not their job to figure out how their work is being used".

Interestingly, I think most devs would think twice before being paid for designing a missile. But somehow they don't really seem to think about the impact of work that is not obviously a weapon. Social network, Stable Diffusion, ChatGPT, SpaceX... everything disruptive has the potential to be very bad (I see a lot more harmful use-cases for ChatGPT than legit ones, but maybe that's just me). But somehow engineers seem to believe that it is not their problem.

keiferski · on Aug 9, 2023

Absolutely, and I think the recent Oppenheimer movie was an excellent take on this exact subject. At some point, you don't get to throw up your hands and say, "technology is just neutral” and absolve yourself of any responsibility for what you’ve put into the world.

wrp · on Aug 9, 2023

My summary of the case: Someone did statistical analysis of a bunch of texts and created a tool that evaluates your text according to the developed model. Writers accused him of plagiarizing/using the content of their works.

nness · on Aug 9, 2023

As an aside, this would be completely legal in Japan, as classification and statistical analysis are protected as fair-use.

I wonder if similar language exists in other copyright systems, but I would imagine it is likely the opposite...

OkayPhysicist · on Aug 9, 2023

This was unambiguously fair use under American copyright law, too.

TrackerFF · on Aug 9, 2023

Are generative models also covered by that clause?

Because one thing about the generative models is that you could in theory get the model to recite copyrighted work, word by word.

riotnrrd · on Aug 9, 2023

> one thing about the generative models is that you could in theory get the model to recite copyrighted work, word by word.

Only if it's so badly trained as to be considered broken. It's not fair or smart to judge an entire method (LLMs) by some broken examples. I wouldn't claim "all cars leak oil" just because I saw a broke-down beater one day.

iamflimflam1 · on Aug 9, 2023

Something that we need to learn is that these brief outbreaks on social media burn themselves out pretty quickly. Everyone shouts for a bit and then moves onto to the next bit of manufactured outrage.

leach · on Aug 9, 2023

I agree, people get angry at something for a day and then the timeline tells them to be angry at the next thing.

Always feel bad for people who cave to the mob, usually if the mob is yelling at you you’re on the right track

LikelyABurner · on Aug 9, 2023

I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

We already tried this with human communication and gave birth to the dystopian nightmare that is social media, why keep repeating our mistakes?

torginus · on Aug 9, 2023

I kind of don't understand the issue - IANAL, so I'm not going to delve on the legality of things, but I think making automated book recommendations better is absolutely fits the bill for 'transformative use' - as in book recommendations are in no way a substitute product for books themselves.

And personally, I think book recommendations are an absolutely underserved market, if I liked a book, having the ability to find more like it would be an absolute godsend for connecting authors with people who would be interested in their works, resulting in much more potential sales for them.

I can't count how many times have I discovered an absolutely great book on Amazon with like 50 reviews accidentally, as well as other, objectively less recommendable books that have nevertheless made an impression on me.

Discovering these books is sort of a hobby of mine, and is the exact kind of activity an LLM would be a great help with.

Going further, if there was an LLM that could be asked for book recommendations for your particular tastes, it could also identify markets for books not yet written, and would give a hint to authors on what sort of books to write to find an audience.

nkjnlknlk · on Aug 9, 2023

> And personally, I think book recommendations are an absolutely underserved market

I haven't read about the industry in years but isn't it the case that the job of "book recommendations" is essentially the publishers job? They unironically try to sell you more than a book. An algorithm would threaten their worth.

(There are, of course, other useful functions like publishing and the irreplaceable editors, but neither require the capital strength of marketing.)

aredox · on Aug 9, 2023

Stop with the "this is good, I want more" Skinner-box model of happiness. Try some serendipity instead of being led by a generic algorithm.

I discovered absolutely great books by moving slowly along the shelves of a library or a bookshop.

And you need to read bad books to understand the great ones.

webnrrd2k · on Aug 9, 2023

The whole point of this site is for people to express their opinions, and torginus took the time to write a thoughtful comment anout LLMs and how they might help both authors and readers.

As for your point about serendipity, torginus never said that he didn't wander book stores and libraries looking for books he wouldn't have been previously exposed to.

Based on the post, I'm sure they understood the basics of reading a variety of books, both good and bad -- there is no need to get judgemental.

giantrobot · on Aug 9, 2023

Are you seriously telling someone else how they should enjoy something?

JoeAltmaier · on Aug 9, 2023

That's silly. Humans review books all the time, using very similar words. Where's the outrage over that?

This is manufactured, stretched, overhyped objections. I believe it's all as the OP suggests, because the word AI is in there. Not because anything illegal or immoral is going on. In fact it's a terribly useful tool, and once the mob cools off it'll likely return.

LikelyABurner · on Aug 9, 2023

You are exactly modeling the chauvinistic Silicon Valley attitude that is causing the outrage in the general population to begin with.

“Our algorithms are pretty much the same as human art criticism, so put down the pitchforks you unenlightened scum” is up there with telling them to eat (a Stable Diffusion generated picture of) cake.

pikahumu · on Aug 9, 2023

> You are exactly modeling the chauvinistic Silicon Valley attitude that is causing the outrage in the general population to begin with.

Just like the writers he talked to and got positive feedback? Everybody not agreeing with you represents "chauvinistic SV attitude"?

Riverheart · on Aug 9, 2023

(Edited)

No, he didn’t say anything about them. People side against their interests all the time, finding a few writers that like this is trivial. Are those people the majority opinion on this or are we just trying to prove how wonderful this technology is?

pikahumu · on Aug 9, 2023

I'm assuming you read the article.

Let's recap:

> I launched the prosecraft website in the summer of 2017, and I started showing it off to authors at writers conferences. The response was universally positive, and I incorporated the prosecraft analytic tools into the Shaxpir desktop application [...]

And he goes on mentioning that some authors even reached out to him to get their books added.

Unless you are accusing him of lying or unreasonably overstating the response he got ("universally positive"), for which I really don't see any indication, then a statement like "finding a few writers that like this is trivial" is not a good faith engagement with this topic/conversation.

Riverheart · on Aug 9, 2023

There’s no way to qualify the sample size of writers based on his claims so within the bubble of his experience I’m sure it’s correct but not useful to base an argument on that writers at large are onboard with this and as for good faith engagement your response to parent…

“Everybody not agreeing with you represents "chauvinistic SV attitude"?”

…wasn’t very good faith either as it’s unclear whether the writers share the same belief as some tech people that AI and humans doing stuff are the same and use that idea to further a pro AI agenda as opposed to them just finding a useful tool to incorporate into their workflow regardless of the underlying technology or politics. Your response assumed the former and paints parent poster as wrong based on your assumption. Some writers liking the tool, just like some artists liking stable diffusion, doesn’t invalidate the original criticism or imply their ideology.

Indeed my experience jives with what he said. Many AI people I’ve seen comment are very much “adapt or die” when it comes to AI technology, suggesting that writers/artists must (even if begrudgingly) use these tools to stay competitive and see many datasets as fair game even when their authors are against its inclusion in said datasets, such as the author of this article.

mardifoufs · on Aug 9, 2023

There's no outrage in the general population. Just of a minority that is just as small as silicon valley.

wink · on Aug 9, 2023

Counting the ratio of nouns to verbs in a novel is an algorithm and I think it's like one of the most basic examples of what the thing in the article does, if I understood it correctly.

But I guess there would also be people up in arms about this.

Nasrudith · on Aug 9, 2023

Do you even know the meaning of chauvinism? Because this is literally the opposite of chauvinism. They aren't stating their view is superior, they just want it to exist.

Chris2048 · on Aug 9, 2023

People starved while it was suggested they eat cake. Not sure how that relates - are the rights around art crit not the same as AI crit?

jhbadger · on Aug 9, 2023

Of course they never were suggested to eat cake in reality. 1) The actual French quote was to brioche, a type of bread and mistranslated as cake because brioche wasn't common in English-speaking countries 2) Was never an actual suggestion -- the French philosopher Rousseau was making a sarcastic remark suggesting that if the people didn't have bread, they should eat brioche (a fancier kind of bread) instead. But for some reason in pop culture this was falsely transformed into an actual suggestion by Marie Antoinette

rockemsockem · on Aug 9, 2023

Glad to see someone else mention the falsehood of the original quote.

It's actually really fitting to see that (mis-)quote used in the context of this outrage since from reading through the original vitriolic Twitter thread it's clear that many of the most outraged are incorrect about what the product does.

Chris2048 · on Aug 11, 2023

IT doesn't really matter to the idiom - what is understood by the reference is what it means.

melagonster · on Aug 9, 2023

or talking about human "just another matrix", so how dare them don't want to offer their artwork for new models.

this even didn't contain how developers decided to let people lose job. people is angry because they worried about losing job.

avgcorrection · on Aug 9, 2023

Or AI-generated Soylent.

palata · on Aug 9, 2023

> That's silly. Humans review books all the time, using very similar words. Where's the outrage over that?

Easy: humans are not machines. "X does it all the time, so I should be able to do it" is never a valid conclusion. It depends on the situation.

> In fact it's a terribly useful tool, and once the mob cools off it'll likely return.

Maybe this tool in particular does not "abuse" the books. Maybe this tool in particular is terribly useful. But you can't blame authors and artists for taking a stance against those new algorithms that provably have the potential to automatically "steal" from their work. You can believe that asking ChatGPT to "write a novel in the style of X" is not abusing the copyright, that's fine. And the authors can answer that they fear it has the potential to break their source of revenue to a point where they won't want to publish anything anymore. And they are entitled to it. And maybe someday we come up with licenses that prevent the use as training data (how in the world could one conclude today that "it is most definitely fair use", given that this is a very new way of using IP material?).

JoeAltmaier · on Aug 9, 2023

That was the accusation, and it was misplaced here. So we agree, this is a smear campaign in this case, not a sensible reaction to a reasonable application of machine algorithms.

The idea that counting adverbs is steal their work to the point they won't want to publish anymore is clearly FUD. As my remark made clear.

palata · on Aug 9, 2023

> The idea that counting adverbs is steal their work to the point they won't want to publish anymore is clearly FUD.

I did not mean that, I am genuinely not sure if you rephrased my point to make it sound wrong or if you missed it.

My point was that, IMO, it does not matter to the other whether counting adverbs is stealing their work or not. Probably if you counted them manually they would be fine (and most likely they were fine before generative AI).

What matters to them is that generative AI is trained from their copyrighted material, and they fear it (I would, too).

The day people stop reading my blog because they can just ask ChatGPT and will get something generated (partly) from my material without any kind of attribution, I can promise you I will stop my blog.

ineedasername · on Aug 9, 2023

This project was not generative AI. Comments are saying this project, which is not at all similar to generative ai, seemed to be okay. But you keep replying to say essentially “but if it was generative ai then authors have a legitimate reason to be angry”.

There is no need to shoehorn that debate into this particular situation, and I see no merit in defending authors that had a knee jerk reaction to this project on the grounds that they have reasonable fears about other types of projects.

palata · on Aug 9, 2023

I think it is not completely off topic. Here is how I see it:

Engineers tend to globally think that LLMs are not really a problem for copyright holders. At least those who develop LLMs pretty clearly don't give a damn. And on top of that, it is in their interest to not be constrained by copyrights.

If this is my feeling (that engineers globally don't care about copyright holders), then it seems reasonable to me that non-engineers could feel the same. That sounds fair, doesn't it?

So those people start speaking up when they see a situation where they feel like "it is happening". And because they don't really know the technology, it is hard for them to know if this particular case is a problem or not. And they can't really trust engineers to tell them, because engineers built LLMs in the first place, and really it does not seem like they care about copyright holders.

Finally, engineers see this reaction from authors, and instead of trying to understand where they come from, they dismiss their opinion. Which probably will reinforce the feeling that engineers don't remotely understand the concerns of those people, and keep building their AI-powered laundering machines. Again, engineers working on those technologies in big companies have absolutely no interest in even considering that it is a problem. Because they get a big salary to help their big company get more profitable, even if it kills many jobs and is a net loss for society (because they benefit from that).

ineedasername · on Aug 10, 2023

To rephrase in my own understanding of what you wrote:

1) Some engineers (or more broadly, software developers) do not respect copyright

2) Therefore you reasonably are skeptical of projects related to material under copyright.

3) It is not always obvious if a project is respectful of copyright.

Now, applying these #1,#2,#3 you believe they justify the outrage for this particular project.

I disagree, because outrage combined with a lack of understanding (#3) is pretty much my definition of a knee-jerk reaction and vastly counterproductive to the interests of copyright holders because it will make the dismissiveness you predict a self-fulfilling prophecy.

palata · on Aug 10, 2023

> you believe they justify the outrage for this particular project.

No, I believe it explains it.

> it will make the dismissiveness you predict a self-fulfilling prophecy.

That's the thing: both parties need to listen to each other. The problem here is not this particular project, but the fact that we are not addressing the bigger concern which is LLMs.

IMHO, it is completely useless to try to solve this particular case, because it will happen over and over again. We need to address the LLM issue.

TimPC · on Aug 9, 2023

If you want to be a pitchfork mob against generative AI at least understand whether AI is generative or not? Seems like a reasonably low bar. This was non-generative AI, it didn't produce content it output metrics and labelled some existing content.

palata · on Aug 9, 2023

What makes you think that I don't understand whether AI is generative or not? What I said was that for artists who are complaining about their copyright being abused, it does not matter. 10 years ago they were not complaining, because AIs looking like ChatGPT (to users who see it as a black box) did not exist (or were not remotely as powerful).

And I understand that. It is not their job to learn how the black box works. What they see is that "machine learning models" (which they probably call "AI" now), which are complete black boxes to them (and that's justified: engineers who train them also don't know exactly what they do, but rather test their model on some dataset and judge it from there). And those black boxes are being trained from their copyrighted work and have the potential to generate a ton of money which they will never see.

You can go and say "you guys should learn how the technology works instead of complaining", but let's be honest: probably you are not an expert in AI yourself, and anyway why would the artists have to care? It is a totally legit question that they have: "Why can engineers take my copyrighted work, run it through an algorithm that does stuff no algorithm has done in history at a scale never seen before, make money out of it, and not even consider that maybe they are abusing my IP?".

Before dismissing the artists, you should try to understand their point of view.

aaaaarrrrrfffff · on Aug 9, 2023

I would disagree. Just because you don't quite understand something, doesn't mean your concerns are not worth consideration - consider the recent zoom TOS issue. I doubt that many of us have a deep understanding of how that data's being used, or the internal guidelines that zoom follows for its data use, and most people aren't lawyers specializing in IP law to know exactly how the law would treat zoom if they were to accidentally (or "accidentally") leak IP. We just see that they are putting in a clause in their TOS to allow themselves to do so, remember our own heuristics of how LLM have behaved in the past, and understandably start raising questions. For all we know, zoom's AI might be something constrained to a framework which doesn't allow for such data leaks to occur, or it's generative capabilities might be constrained in some other way. They're just demanding legal permission to do so, but that still rubs a lot of us the wrong way. Our concerns are still justified, even if Zoom never actually touches AI. Artists lack as concrete heuristics as the technical crowd. But they still have concerns that need addressing, and those concerns about the effects of AI still should be considered and respected. If the details of the situation don't match their concerns, care should be taken to explain how they don't match to the people in question, in a way that isn't looking down on them (admittedly, trying to be the calm voice is often a waste of time on the internet) That said, if you were to make an informational video which succintly summarizes the technical details that are relevant to artists, it might become sufficiently popular to influence debate.

aaaaarrrrrfffff · on Aug 9, 2023

(to clarify, this is a response to skjoldr's comment)

skjoldr · on Aug 9, 2023

> It is not their job to learn how the black box works

If you have not learned the basics of how something works, you have no right for your opinion on it to be considered valid. Period.

Invalid opinions do harm to democracy and endanger our way of life.

palata · on Aug 9, 2023

> you have no right for your opinion on it to be considered valid. Period.

That is so wrong it is actually dangerous. Do I need to understand how a nuclear bomb works for my opinion on it to be considered valid? Obviously not. I only need to understand the consequences of it. It does not matter at all how it works, if I am against the fact that it will kill a whole lot of people.

> Invalid opinions do harm to democracy and endanger our way of life.

And engineers have done much, much more to endanger most living animals (including humans) than authors and artists: technology is the reason for the mass extinction we are currently living, and the problems that are coming with climate change. Maybe it's important to start thinking about the consequences of what you do, not only the technicalities of how you do it. And maybe it's high time you start listening to people who are able to think about the consequences of what you do (maybe they understand that better than you do, ever thought of that?), even if they don't know how to do it.

jsnell · on Aug 9, 2023

You can of course have any opinion you want. But this is not just about the authors having an opinion. It's about them starting a harassment campaign based on just faulty facts and making no attempt at verifying them.

If we work from the nuclear bomb analogy, you certainly don't need to be a nuclear physicist to protest nuclear bombs. You just need to have some a reasonably correct high level understanding of the impact of a nuclear bomb. But that's not what is happening here. This is more like storming the Belgian embassy to stop Belgium from using their nuclear arsenal to trigger a chain reaction in the atmosphere: totally detached from reality in every aspect.

As far as I can tell from your messages on this, you think that the harassment was entirely justified. Is that correct?

palata · on Aug 9, 2023

> totally detached from reality in every aspect.

I don't think it is totally detached from reality. I believe that engineers are generally pretty bad at realizing the impact technology will have on society. There are many concerns with generative AI in general: it can potentially "break the Internet" (by finishing breaking search engines which already struggle with SEO), or maybe democracy, who knows? Copyright is one such problem.

> you think that the harassment was entirely justified. Is that correct?

I honestly don't know how far it went. What I saw in the article is a few authors who wrote online that they wanted their book removed from that software. Not sure if it is closer to harassment or to lobbying.

What I see, however, is many comments of engineers who don't see the problem with copyright and who don't seem to understand why non-engineers may be against this technology, or why one would even think about forbidding a technology ("but technology is neutral"). My point is just that those engineers should maybe take a step back and try to reflect on that "technology is neutral" belief.

skjoldr · on Aug 19, 2023

You cannot know the consequences of something if you do not know how it works. Case in point: nuclear reactors. If you do not know how they work, what are their potential dangers, and how they are mitigated by smart design, you do not have a moral right to protest against them. Simple as. Understanding the risks and consequences equals understanding the system in question. Always. This also applies to nuclear weapons, if you do not understand MAD and how they keep other powers in check, and you never had a true threat brief that would explain what exactly nukes are a deterrent against, you just aren't entitled to an opinion on them. Especially one as simple as "it can kill people so I don't want them". This is an invalid opinion, sorry.

rockemsockem · on Aug 9, 2023

If you literally have no idea what a nuclear bomb does, i.e. don't know that it explodes, releases massive amounts of heat, or can kill many tens of thousands of people at once, then no your opinion should NOT be considered valid.

Understanding the consequences of something is a PART of how it works. Since you understand that it can kill a whole lot of people then I'd say you have passed the incredibly low bar.

In this case most of the authors do not understand the consequences of the tool, they think it will generate convincing sound text that sounds like them or that it is serving pirated copies of their books (sourcing that from the original Twitter thread that I unfortunately read a lot of).

This doesn't seem like the thread to debate whether technology is a good thing, but I can't help but call this assertion ridiculous. Technology is responsible for almost every single good thing in the world today.

palata · on Aug 9, 2023

> In this case most of the authors do not understand the consequences of the tool

Because you do? That's my point: engineers believe that because they have some understanding of how machine learning works (and in my experience, usually it is very limited...), they can conclude that they understand the consequences of it. Simple example: the Facebook "like" function, that was supposed to be positive ("oh nice, I got likes"), and actually increases addiction and is mostly negative ("oh no, why did I not get likes?"). Clearly those who implemented the first likes had not realized what consequences they would have.

> Technology is responsible for almost every single good thing in the world today.

If you have a very limited view of the world, I guess it could be. I like trees, flowers, bees, birds, mountains, snow. Can you tell me which ones come from technology? Let me help you: most of them are threatened to die in this century because of technology. For most living species, every single improvement technology is bad news. To the point where it is now globally becoming bad news for humans, because it's quite likely that we will get into global instability, wars, and famines in the next few decades because of technology. Think about it when we start having billions climate refugees, and think about how you were dismissing opinions contradicting your beliefs based on the fact that you understand some implementation detail.

But let's even ignore the fact that the next few decades will most likely get pretty bad for us. It is true that right now, we live longer, we have more food (and obesity problems), and we can cure many diseases that we could not in the past. Does that mean we are happier? Happier than whom? Vikings? Ancient romans? Ancient greeks? That question seems closer to history and philosophy... why does your opinion count then? Are you historian/philosopher?

TimPC · on Aug 9, 2023

Because fair use allows transformation and the output of their algorithm looks nothing like the input of the copyrighted work? For generative models its more complicated because generative models can actually reproduce large sections of a copyrighted work so transformation is a bit less clear.

palata · on Aug 9, 2023

> Because fair use allows transformation and the output of their algorithm looks nothing like the input of the copyrighted work?

I feel like you miss the point of a law. You seem to read the law, and say "well, the law says X, new technology Y is compatible with it, so that's legal, everyone is happy". But that is wrong. The law reflects the society we want. Do we want a society that completely kills creative work because Big Tech found a loophole to launder their IP? I guess we all agree that we don't. It is not clear if LLM is that loophole, I agree. But you seriously have to take a step back and think about that. What if it does? Then we may have to redefine the meaning of "fair use".

Maybe this particular software was not a danger for those authors. But they don't know that. And given that most engineers talking about LLMs don't seem to remotely understand how one could be worried about it, I understand that they start speaking up wherever they can't. Because clearly it does not seem like those who build those systems give a damn about copyright holders.

TimPC · on Aug 10, 2023

I think the actual issue is nuanced and complicated. I think it's fairly clear the tool in question that was non-generative AI is the kind of thing we want to allow under fair use. Whether we want to allow generative AI is more complex, I'd lean towards requiring a license because of non-deterministic duplication. Fair use is an important part of copyright law and we should be very cautious about eroding it. For example, I like Green Day's transformation of the scream icon and think it was substantially different enough that it should be allowed. The courts agreed under current transformation laws but if we weaken protection against transformation we likely reverse the ruling of cases like that as well.

Nullabillity · on Aug 9, 2023

They… quantify the number of adverbs and voices? I'm sorry, but have you ever read either a book or a review of one?

JoeAltmaier · on Aug 9, 2023

Yes, and they have 1000 times the spoilers and quotations and judgemental attitude of ... a summary of adverbs and voices.

So yes, I understand what a review is, thanks for the put-down, that certainly added something to the conversation.

I think we are in agreement - doing statistical analysis on written works is entirely a lesser thing than simple review, and is harmless.

AnimalMuppet · on Aug 9, 2023

"It's a mistake." OK, you could be right.

"Needs to stop." OK, you could be right on that one too. I don't think you are, but that's not the point.

Neither of those adds up to "it's currently illegal". (Whether it's actually illegal probably depends on the details of how he did what he did.)

Further, neither of those things adds up to "the howling mob should attack him until he stops". (Even if the "attacks" are purely online.) I am against "attack him with outrage dialed all the way up to 11 without actually understanding what his tool is and does". I am also against giving in to the outrage - it just shows the mob that baseless outrage attacks work.

You think it needs to stop? Fine. Persuade him that it needs to stop, and therefore that he should stop. Convince him - not with a mob screaming in outrage, but with reason.

vidarh · on Aug 9, 2023

As someone who has published two novels: The outrage over this site was stupid, ignorant and a demonstration of a witch-hunt that will help nobody.

macNchz · on Aug 9, 2023

If you’re an author of books intended for children, your texts are likely already being quantified to produce a reading level difficulty score:

https://metametricsinc.com/parents-and-students/lexile-for-p...

Nullabillity · on Aug 9, 2023

Honestly, this is the really offensive part of the article. Who cares about whether or not it's legal, the idea that it's, in any way, shape, or form, useful is bafflingly laughable.

Not everything can be meaningfully quantified. Not everything needs to be.

runj__ · on Aug 9, 2023

Certainly something interesting is bound to come out of quantifying things? "Hm, this three act structure thing seems to work, I wonder why." "Children doesn't seem to understand texts which include these words, I wonder why."

Patterns rarely show themselves before we investigate.

Nullabillity · on Aug 9, 2023

> Certainly something interesting is bound to come out of quantifying things?

In science they call this trap P-hacking. Even data "scientists" know to be wary of overfitting. We're really good at finding patterns, but few of them actually mean anything.

davorak · on Aug 9, 2023

>> Certainly something interesting is bound to come out of quantifying things?

> In science they call this trap P-hacking. Even data "scientists" know to be wary of overfitting. We're really good at finding patterns, but few of them actually mean anything.

Quantifying things is not always p-hacking. When people do experiments on novel materials or structures they quantify the data, make readings and record them, and then look for patterns. For example measuring the electronic properties of a new novel nano structure or molecule.

When I think of p-hacking[1] I think of using the same static data and doing various data analysis over and over again until something potentially interesting is found and ignoring the risks of false positives as you do so.

[1] https://en.wikipedia.org/wiki/Data_dredging

capableweb · on Aug 9, 2023

> Not everything can be meaningfully quantified. Not everything needs to be.

Ok, so who decides what's OK to analyze or not? Is there some obvious moral line I fail to see, that everyone would immediately agree on?

It seems the project was about analyzing books, not about producing new books. How is that hurting the authors?

LikelyABurner · on Aug 9, 2023

What will hurt artists is, when in 10 years, all publishers are demanding that the vividness score (TM) be at least a 95% “because that’s what drives sales”.

Which is what will happen if the authors don’t proactively stop it from happening. Look at how the music industry has evolved over time.

croes · on Aug 9, 2023

How his this different from all the vampire novels that hit the shelf after the success of Twilight? Publishers alway preferred the money makers, just the measure changed.

Nowadays writers can at least publish their books without the need of publishers and I think some like the help of the bad Silicon valley stuff that made writing, publishing and interacting with the readers easier.

I'm on your site if it's about automatic content creation and style copying but text analysis is not the real danger. Especially when the usefulness of such statistics isn't even given.

nottorp · on Aug 9, 2023

> publish their books without the need of publishers

Except those are very likely to be metoo vampire novels. And lately LLM generated.

I'd move that on the contrary, the role of the publisher as a curator will only become more important in the future.

croes · on Aug 9, 2023

But publishers will have to deal with a lot more content thanks to LLMs.

runj__ · on Aug 9, 2023

Or it could help me find terser books I like, people will still have preferences and if the author tries to pander to only the largest market segment I'd argue that's on them.

jprete · on Aug 9, 2023

I think it’s much more likely you would get the book equivalent of crap SEO sites spammed out to satisfy numerical measures of quality.

robertlagrant · on Aug 9, 2023

How is this different to the current process, other than feedback is slower (if forthcoming at all) and less specific?

palata · on Aug 9, 2023

> How is this different to the current process, other than feedback is slower (if forthcoming at all) and less specific?

Let me rephrase your question: "how is it different to the current process, other than <the fact that it is different>?" :-). I would say that the answer lies in the question.

robertlagrant · on Aug 9, 2023

Sounds as though your view of the AI is purely positive, in that case. That's fair enough. The answer for other people may well not lie in the question (e.g. for all the people who don't like this development), but it did for you!

palata · on Aug 9, 2023

Sorry I did not understand that :-).

My point was that it is different: when humans read a book, they don't train a machine learning model. They can't read as many books as a machine, at the same speed, and they can't remember nearly as much as what a machine can.

Humans and computers are fundamentally different, and it matters. You can't conclude that because it works for one, it will fork for the other.

robertlagrant · on Aug 9, 2023

> Sorry I did not understand that :-)

You seemed to be saying that the differences I listed (quicker and more specific feedback) were the only differences. Those are both positive.

I was saying that some people may think there are negative differences as well.

palata · on Aug 9, 2023

Right. Yeah I did not express myself clearly, sorry :). You were saying "how is it different other than X and Y?", and I wanted to say that X and Y are already enough for me to consider them different.

I am actually on the side that LLMs are a big problem for copyright, and I don't want my code and blog posts to be used in their training dataset without my consent. To me, at this scale, it's not fair use. IMO it's a bit like if Facebook said that it is fair use to leverage metadata about their users, because "someone who sees you in a public space talking to a friend knows that you are talking with that person, and it is the same for Facebook on social media". My problem is not that Facebook knows that I sent a message to a friend now, but rather that they know who writes to whom and when, at scale.

Similarly my problem is not that somebody could read my blog post, learn from it, and write another blog post. My problem is that LLMs automatically train on all written material they want on the Internet, at scale, and without acknowledging that all that material has a lot of value (and is copyrighted).

I think fair use should somehow consider the scale.

em-bee · on Aug 9, 2023

the difference is that an machine analysis is necessarily limited and can't account for all the factors that make a text interesting. so it is possible that this analysis rejects texts that would not be rejected by a human.

it is objective but potentially biased. and it could even be discriminating if the input for this tool isn't diverse enough. but these are the issues that can go wrong with any use of technology, and we have seen many examples of that happening. however i don't think that is problematic if writers use it to analyse their own texts in comparison. it is however a serious issue if publishers use it to decide what to accept

Nullabillity · on Aug 9, 2023

Again, I don't particularly care about whether this is allowed to exist, I'm just here to laugh at the mindset that lead to it being created. But sure, I can see this being used in harmful ways.

> It seems the project was about analyzing books, not about producing new books. How is that hurting the authors?

"Vivid books are really in this year, we're gonna have to ask that you aim for a Vividness(tm) of 85 or above."

"US books have 15% more adjectives, clearly this is proof of our superior detail-oriented work ethic!"

"What does the rise in Emotion(tm) have to say about the decline of society?"

capableweb · on Aug 9, 2023

So if I understand you correctly, you're saying that we should not create "metrics" for anything because said metrics could be misused by clueless people?

dmurray · on Aug 9, 2023

The analysis is cool. The problematic thing is what would have happened next, if this tool turned out to be any good.

Publishers rejecting manuscripts because "this years trend shows customers are looking for vividness in the 70+ percentile, your book is only at 55". Everything becoming the same style. If you thought Hemingway, Joyce or Nabokov had it bad with rejections, there'd be zero chance for actual innovative writing to break through the walls of The Algorithm.

fullspectrumdev · on Aug 9, 2023

Joyce should have had more rejections, but that’s just my personal opinion

TheCoreh · on Aug 9, 2023

> Not everything can be meaningfully quantified.

Sure, but written words _can_ be meaningfully quantified. We have been doing that for thousands of years. Starting with numerology and other mystical/religious beliefs, poem metrics, stylometry, crypto analysis, stroke counting, to name a few.

> Not everything needs to be.

Why not?

andybak · on Aug 9, 2023

> Honestly, this is the really offensive part of the article.

I would argue that "Offensive" is either hyperbolic or you've used the wrong word.

> the idea that it's, in any way, shape, or form, useful is bafflingly laughable.

I don't know if it's useful because I never tried it. I might harbour my doubts but I'd like to find out. This is how I approach new things.

pikahumu · on Aug 9, 2023

If you don't find it useful, don't use it. But why get outraged about something that others find useful? It's clearly a tool that other writers were positive and excited about. Why not let them have it? If you don't find those quantifications meaningful, so be it. You don't need to use it. Why force your opinion on others?

r053bud · on Aug 9, 2023

Simple. Just allow an opt-out for Authors or Publishers. Then only interested parties will comprise of and make use of the service, like you want.

pmarreck · on Aug 9, 2023

As the article stated, there is nothing either legally or morally wrong with what the site did, and many authors found it useful. Let us know when you come up with an actual counterargument based on reason instead of an appeal to emotion fallacy.

planede · on Aug 9, 2023

I probably agree, but how does this have any relevance to copyright? If the tool is bad but otherwise legal then it should just fail on its lack of merit.

croes · on Aug 9, 2023

There is a difference between a statistical analysis of a text to categorize by certain words or word groups and training an AI model to generate texts on the data used for training.

The later creates massive competition to human writers, the former is just an information for potential readers.

subroutine · on Aug 9, 2023

Both the former and latter are information for writers. Neither create massive competition for writers (not that there is any law against creating competition), just FUD and better tools for writers.

croes · on Aug 9, 2023

Things like GPT already create competition for authors even using their names.

https://news.ycombinator.com/item?id=37042561

Pure text statistics won't do the same.

subroutine · on Aug 9, 2023

Wrt. your link, the same thing could have happen to this author if these spam books contained complete gibberish (and someone listed it on Amazon/goodreads using the authors name). This isn't legitimate competition (i.e. books written by LLMs that rival the quality and style of the actual author). This is a failure of the selling platform to QC the books they are selling.

croes · on Aug 9, 2023

LLMs make the scams better just like they will make spam better.

If it's gibberish you know you got scammed, LLM texts look convincing so you don't know for sure.

subroutine · on Aug 9, 2023

I agree LLMs can make better spam. But good spam isn't real competition. It's not like anyone is debating whether they should buy the latest book from their favorite author or the latest book from their favorite author's clone LLM (which is known to have written some solid books). Again, this is an issue that needs to be solved by vendors (it seems like all they need is a system where authors get a copyright to their name, and can curate the list of titles published under their name).

croes · on Aug 9, 2023

That's partly true for established authors,but how many readers will get fooled and think those books are real. See these 11 Harry Potter books from China

https://11points.com/11-amazing-fake-harry-potter-books-writ...

It's worse for new authors, they disappear between all the AI authors.

Publishers and readers will have to search a bigger haystack to find the needle.

sobellian · on Aug 9, 2023

I agree that it's probably not that useful, but to actually take offense? The outrage seems to misunderstand the law and the technology. If you think the numbers offer no meaning, then just ignore them. People produce bad tools every day and the world still turns.

dontupvoteme · on Aug 9, 2023

I'm disappointed he went for "vividness" and not novelty. Judging text based on how uncommon the n-gram is/how much it differs from an LLM could be interesting for sure.

The better an LLM can complete your joke the worse it is, for instance. Important to have a good Letterman-MacDonald quotient.

jasonlotito · on Aug 9, 2023

> I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

Usefulness is immaterial here.

Is he allowed to do this? Yes.

What's wrong with presenting a page count and word count, for example?

golemotron · on Aug 9, 2023

> I’m with the artists on this one. Our obsession with converting everything into input for an algorithm that spits out an ill-defined number (what the hell is “vividness”?) needs to stop.

Anyone who is with the artists should pass a law. Moral outrage is not law.

Nasrudith · on Aug 9, 2023

And promptly smack face first into the First Amendment. There is a reason they are going with moral outrage. Because they know they don't have the right.

sneak · on Aug 9, 2023

The guy who wrote code is also an artist, and he is allowed to publish his book reports.

thendrill · on Aug 9, 2023

Well... We still can not agree if...

Technology has to be protected from dumb people, or is it worth protecting dumb people from technology....

LikelyABurner · on Aug 9, 2023

If anything, it’s the smart people that need to be protected from technology, because Silicon Valley is obsessed with pulling them down into a Harrison Bergeron nightmare where they’re absorbed into the same modeled probability distribution as the rest of the population to better sell them ads (outliers are bad for profits.)

jprete · on Aug 9, 2023

What do you even mean by this comment? Have you considered the possibility that people are smart in ways that you are not considering, rather than just labeling it “dumb”?

s1mplicissimus · on Aug 9, 2023

Have you considered ... in ways that you are not considering...?

I am pretty confident they haven't. Sounds like you've set yourself up for a reverse "true scotsman" here ;)

jprete · on Aug 9, 2023

Nice catch, thanks for pointing it out.

soco · on Aug 9, 2023

For this very particular project I agree the reaction seems exaggerated, even though it does walk the thin line of copyright infringement. But as it happens, it rides the wave of all other AI project which started small then headed we know where. Because once your book is in the database of company X you can bet safe money they will take it and continue their "analysis" as much further as they like because hey you did not complain - and I believe you must defend yourself in order to keep the copyright.

Robotbeat · on Aug 9, 2023

The response to this undermines my ability to take the “backlash” against AI as anything other than innumerate, mob idiocy. It’s hard to prevent myself from being negatively polarized against the backlash when people in the backlash defend outrage against obviously innocuous things like this.

vouaobrasil · on Aug 9, 2023

We repeat the mistakes because in the short term, someone finds it profitable, hence a prisoner's dilemma type situation.

If an AI tool was killed, I consider it a victory. That's because even if there are some small useful applications of AI, AI on the whole will certainly put most creatives out of business.

Instead, I propose the following: anyone who is interested in preventing AI from taking over their craft should join me in a coalition of ban AI from their own business. By placing a notice that your work is "100% AI FREE", you are doing something akin to the fair-trade/sustainably sourced sticker on chocolate or other food products: you are letting consumers know that your work was made by a human, so that they can support you.

If enough people get in on this, and pledge to support only those creators who don't use AI, then we can make AI an unprofitable venture and hopefully kill it forever!

I already put a 100% AI FREE badge on my YouTube channel, which means that I will never use AI for writing scripts, editing videos, producing images, etc. Moreover, I also pledge to support other creators who pledge never to use AI, by buying their products over others!

kiwih · on Aug 9, 2023

Without trying to sound flippant - what do you define AI as? Things like autofocus in your video cameras or automatic gain control or noise cancellation in your audio pipeline could also be considered AI. Do you remove those too? What about the AI recommendation algorithm built into YouTube - how do you reconcile being AI free while still using that platform?

vouaobrasil · on Aug 9, 2023

Yes, you are right, and I advocate the following: a detailed look at each of these technologies.

However, for practical purposes, a direct definition that encompasses every situation is not necessary, but can evolve. For now, I think we do not need a precise definition and we can start with the following: AI such as ChatGPT, LLMs, image generation tools like DALL-E and ohters, should be restricted.

As for YouTube's algorithm, I agree it is also dangerous. For now, I have restricted the use of direct content generation algorithms, in other words, all content can reasonably said to be human generated in terms of writing, composition, etc.

In other words: AI that makes any creative decision in making content should be banned. Other algorithms should be carefully debated.

danielbln · on Aug 9, 2023

Banning automation technology because it could put workers out of business.. isn't that the textbook definition of a luddite? Also, are you saying no creative people are using these technologies? It's not all "enter 1 prompt, get image, call it a day", they are tools that can be and often are part of complex chain. Creatives that don't want to use these tools are probably going to be superseded by creatives who do.

What's your take on generative fill in Photoshop?

vouaobrasil · on Aug 9, 2023

I am a luddite. What's wrong with that? I don't believe that all technology is bad, but that AI has reached a stage sufficient so that the order of magnitude of the changes it can affect are too damaging for humanity. I do believe that AI has become advanced enough to pose such a risk to us.

Some creative people are using these technologies, and while it is quite human guided NOW, at some point, the guidance that humans put into it will lessen. That's not to say that AI will ever produce a work like Dostoevsky --- maybe it won't, but it WILL be enough to eliminate most creative jobs, and reduce them to being at most being supervised by people who don't have much of a passion for creative works. And that's a shame, because it will remove the passion of creativity from society.

Generative fill: I don't use it, and that's part of my personal ban. It goes too far. I only use traditionl editing techniques in my photography that works with basically what is there.

Yes, you can say that photography has always been about manipulation, but basically, I have a personal line that I believe I can define sufficiently well, that is far behind the line of AI.

subroutine · on Aug 9, 2023

What about the youtube speech to text AI that creates automatic CC transcripts for the hearing impaired? What about the AI that translates transcripts and comments into other languages? (translation certainly makes use of creativity since not every word maps 1:1)

vouaobrasil · on Aug 9, 2023

One can always say that AI has some positive uses like CC transcripts. And of course, I can't prevent the platform from making that. I only mean to say that personally, I will not use it in the basic process of video creation. If YouTube ever forces any sort of editing on my videos through AI, I will quit the platform.

But returning to the topic: even though AI has some benefits, I believe that AI in the long run will have negatives that FAR outweigh the positives, so I believe it still should be restricted.

As for translation, well, the AI transcription/translation sucks. I do attempt to put manual captions in my videos as much as I can though.

subroutine · on Aug 9, 2023

So far we have identified like 10 ways you and your audience are currently benefitting from AI, but you haven't mentioned any concrete way AI is harming you.

Also, what do you mean by "forces any sort of editing on my videos through AI". Do you mean like, changing the actual content of your videos?

belorn · on Aug 9, 2023

Relying on Authors Guild, Inc. v. Google, Inc in order determine fair use for AI models, there are a few key aspects to consider. The outcome should not supersede, supplant or become a replacement for the original works, nor should they sell portions of them. It should also preferable enhance the sale of the original work to the benefit of the copyright holder.

In this specific case regarding prosecraft, all those criteria might be fulfilled, and it might be so that under those specific conditions the use of any copyrighted work for the creation of AI models is fair (or at least under US law).

ricochet11 · on Aug 9, 2023

its sad he took the site down, it looks like a neat project. it seems to be fair use, so it really is just an issue of consent and keeping people happy. the issue is some people will always be fearful/miserable. should the rest of us be held back in exploring culture because they refuse to play?

anyway i'm sure there will be ten other similar sites by the end of the week...

LikelyABurner · on Aug 9, 2023

“That book’s vividness score (TM) is 75% opposed to that other book’s vividness score (TM) which is only 50%! That’s, like, a 50% higher culture score (TM)!”

croes · on Aug 9, 2023

What if I prefer books with a 50% vividness, so bought the book that was to vivid for me and write a bad review or are online reviews also considered bad Silicon valley tech?

BTW do you write per hand or type writer or do you use word processors with formatting, thesaurus, spell and grammar checking?

fsagx · on Aug 9, 2023

With CLOUD and AI happily in the descriptions, with no private desktop version, and your email to sign up, please.

whywhywhywhy · on Aug 9, 2023

Doesn't seem like the fear of AI more just authors being petulant, didn't we have the same thing with some hack Star Wars book author attacking the internet archive for daring to host a copy of his book before the AI fad.

fsociety999 · on Aug 9, 2023

What I find so odd about all this stuff is the target is very rarely OpenAI/ChatGPT. I understand it can be a useful tool, but if your concern is that AI has scanned your books without your consent and can generate new content in your writing style, then OpenAI is who you should be complaining about.

Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Open AI is who people should be targeting since they are the ones who have all the money and the politicians in their pockets to basically stomp out any competition. My real fear is not that people find creative uses for AI on a small scale, but that Microsoft/Open AI builds a centralized system that works on their terms where you are forced to play by their rules and they decide what is fact and fiction.

I haven’t read Zach Rosenberg before, but I put this prompt into ChatGPT, and sure enough, it generated what I presume to be writing in his style:

> Could you write two paragraphs in the style of Zach Rosenberg arguing in favor of shutting down a tool that uses AI to analyze the text of his books?

Did the author consent to Open AI scanning the text of his books to generate new text emulating his writing style? Where is the outrage over that?

Nasrudith · on Aug 9, 2023

> Somehow the project with tens of billions of dollars in funding from Microsoft gets a free pass, but a two person passion project that makes no money gets viciously attacked and killed. The same thing happened with generative art. The open source tools and smaller projects got served with lawsuits, but somehow DALL-E was not included in those.

Isn't it obvious? Bullies always go after easy targets. And nothing is more popular or loathsome than self-righteous causes for bullies.

jprete · on Aug 9, 2023

This is an incredibly biased article, hinging entirely on the assumption that AI training is fair use.

harshreality · on Aug 9, 2023

"Fair use" only applies to instances of copying / redistributing. The hint is in the name: copy-right.

There's a notion, which seems to have taken off among creators who are paranoid about AI eating their livelihoods (which it might eat a chunk of) that copyright prevents people from doing anything with works they [legally] acquired other than personally read, listen, or watch it.

That's not how copyright, as it has existed in the past, works. You can do all the algorithmic processing of your ebook collection that you want. You might be able to display small portions of a book to others, depending on the situation.

Quoting one or two paragraphs out of an entire book seems like reasonably safe fair use, but that won't stop a copyright-maximalist creator (or their publisher) from suing you, and won't stop some copyright-maximalist judge from ruling against you, so it's probably best to minimize the amount of content from a book that you redisplay directly. But you can do all the analysis and statistics generation you want, and display those results to others.

It remains to be seen what judges will do with AI generation of works based on ingesting gigantic amounts of copyrighted work. The entire framework of copyright is going to be broken, and until Congress steps in and changes it, judges are going to go every which way. There's no bright line for 4-factor analysis; it's always been a gut-level "is this a reasonable use that doesn't impact commercial sales too much". There's no possible rational way to draw a line. AI models can generate a painting of a new subject only loosely in the style of a contemporary painter, which would not be copyright infringement, or it can generate a near-clone of an existing work with the right prompting, and depending on how clever the prompter is, a lot of intermediate stages of likeness. Who decides how close to an existing work is too close?

palata · on Aug 9, 2023

Say I make a tool where you can enter the title of a book, and get the full text of the book without paying for it. I assume we all agree that would be illegal, right?

Now say that instead of distributing that tool as an executable, I distributed it as a library. It would contain all the same books as the illegal executable above, but some developers would need to write an actual executable that would use the library. Obviously that executable would be illegal (same as the first one above), but would the library be illegal? I would think that we all reasonably agree that my library would be illegal, because I would essentially be selling the content of books for which I don't own the copyright.

Now what happens when I distribute a trained model that has the potential to generate illegal work? Is it suddenly legal for me to sell it, because some developers write legal executables with it? How is that different from my library example above?

subroutine · on Aug 9, 2023

Google Books already does exactly this. It has a library of the full text of millions of books. Users can search for a passage of text and google will display the paragraph where the passage is found.

https://books.google.com

example:

https://i.ibb.co/DCxJpHN/IMG-3143.jpg

palata · on Aug 9, 2023

Google does not provide the full book, does it? Exactly like they could provide a few seconds of a song, but not the song in its entirety.

subroutine · on Aug 9, 2023

No, they don't provide the full book, just a few sentences before and after your search prompt (same as Prosecraft).

In both cases, however, if you had the patience, you could search the last few words of the text returned from your prior query and slowly work your way through the entire book.

harshreality · on Aug 9, 2023

A few sentences? For most books I've seen, it's a few pages. Google will block you from retrieving more pages from the same book eventually. Using a VPN and a different account may get around one limit, but I experimented with multiple VPNs and browsers once, and although I was able to get a majority of a book's pages, after that google stopped showing me full previews of any of the remaining pages no matter where the request came from.

subroutine · on Aug 9, 2023

It shows you a few pages if you are previewing the book (i.e. "look inside"). But if you are using search, it will show you where your search query shows up in the book, no matter what page the search query is found. This means you could theoretically search a book sentence by sentence, and it will eventually have shown you the entire book. I'm not claiming this is an efficient or practical way to game the system and read books, only that google books does contain the full copy of the book text and can reveal the contents of any passage. This is basically how Prosecraft works (at least what i glean from the article) - it doesn't let you read a whole book, even though it may contain a representation of the full text.

palata · on Aug 9, 2023

Sure. I really did not mean that specifically for Prosecraft. But the article questions why authors are attacking Prosecraft "because it does no harm". My answer is that authors don't (and can't, really) make the difference in a per-case basis.

At this point what they see is that LLMs trained on their copyrighted material are able to generate similar material thanks to their copyrighted material that was used in the training (that is important!), and they see that they won't get paid for that. Of course they are scared, and they should be. And of course they will now start attacking everything that looks like it is using their copyrighted material as training data.

I really don't get why the engineering world does not get this: LLMs have the potential to ruin people's jobs, it is not clear at all that this is legit (IMO LLMs could not do it without the copyrighted material they used for training, therefore they are derivatives of the original work), and those people are rightfully scared.

harshreality · on Aug 9, 2023

I think most people who have thought about it understand the impact AI models seem destined to have on writing (and digital 2d art, soon music, and later other things). In addition to writers and voice artists panicking, see the Hollywood strikes, for instance, and what's currently happening in the corporate world to digital artists.

Copyright is not the correct tool to address it. In the U.S., the basis for copyright law is a clause in the constitution which allows Congress to pass laws to secure for limited time creators' exclusive rights to their creations, for the purpose of "promot[ing] the progress of science and useful arts."

If AI is going to displace nearly all of that content, at least content that's mostly entertainment and less educational (novels, music, images, eventually movies), doesn't that imply that AI is better at generating useful entertainment than humans are? Aren't hypothetical laws restricting training of AI on past works to protect the market for future human-created works a direct violation of the constitution, if AI is better at generating "useful" entertainment (entertainment that people will buy) than humans are?

To whatever extent AI can eventually displace nonfiction content in the sciences and liberal arts, which is more indisputably useful, the same argument will apply; however, it applies to fiction content first, because current AI models have more trouble with facts (trouble in the form of hallucinations, for instance). The threat in non-fiction areas of creativity is more distant than the threat in fictional areas of creativity which are not as bound to reality.

palata · on Aug 10, 2023

> I think most people who have thought about it understand the impact AI models seem destined to have on writing

Go back to the beginning of social media, and tell me that "most people who had thought about it had understood the impact social media would have on society". It is really not a given. And that is my criticism: we see from history that it is not straightforward to understand the impact of new technology, but we engineers keep making the same mistakes over and over again.

> Copyright is not the correct tool to address it

Maybe not, that's right. I don't think anyone disagrees. The issue - at least from the point of view of artists - is more that some people (including authors and artists) want the problem addressed, and others (including engineers) just want to make money with their new toy and don't care much about addressing the problem.

> doesn't that imply that AI is better at generating useful entertainment than humans are?

I don't think so, no. It is maybe economically more successful, but I think it is clear that what is good for the economy is not necessarily good for society.

> however, it applies to fiction content first

Well... that is ignoring all the black hat use-cases, going from phishing to political mass manipulation, I would say :-)

subroutine · on Aug 9, 2023

Why didn't you just say that, instead of posing a hypothetical about software that may itself contain full book text which can be used to display (in this case fair-use) passages to end users? lol

I think the disconnect between your point of view and mine is that I see "training an LLM on copyrighted text" the same as a person reading copyrighted text, which is perfectly legal. And I see violating copyright as a person or LLM reproducing copyrighted work (illegal). But using other works as inspiration for something novel shouldn't be considered illegal, whether a person or LLM produced the work. I would even be fine with literature being treated more like music, where reproducing the essence of a piece of work (i.e. doesn't have to be a word for word reproduction) is considered a violation. But if the LLM creates something completely new, how is that a derivative work / infringement?