Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.

I don't expect this lawsuit to lead anywhere. But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding. The recent ruling regarding web scraping makes the case against OpenAI a lot weaker. [1] Data scraping publicly available data is legal. People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.

I keep seeing this idea reoccur in the suit:

>Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...

Anyone is able to file a suit, I wish people stopped assuming that a news report automatically means it has merit.

1. https://www.natlawreview.com/article/hiq-and-linkedin-reach-...



> But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding.

One of the "I wonder where this will go" things with the reddit and twitter exoduses to activity pub based systems is that it is trivial for something to federate with it and slurp data without any TOS interposed.

The TOSes for these systems are typically based based on what can be pushed to them - not what can be read (possibly multiple federations downstream).


Article titles which specify the plaintiff classroom amount are a good indicator of poor journalism.

You can usually disregard such articles as you can expect biased/incomplete reporting.

Lawsuit claim amounts have zero bearing on reality. They must be specified in any classroom, but lawyers just always specify massive amounts without justification.

Any reporting on this amount indicates ignorance in the system or intentional dishonesty.


Also note that the damages typically can't be adjusted up, only down.


Regardless of access rights to the data, I've yet to read a compelling argument why LLMs are even derivative works. You can't identify your Reddit comment in a ChatGPT conversation. How is it any different than a human learning English by reading Reddit? That human wouldn't be violating copyright every time they said a phrase that was repeated by hundreds of Redditors.

My favorite LLM analogy so far is the "lossy jpeg of the web." Within that metaphor, I don't see how anyone can claim copyright on the basis of a pixel they contributed that doesn't even show up in the lossy jpeg. They can't point to it.


I've been thinking of the output as fanfiction/fan art. It shares many of the same complications regarding the ownership of ideas, commerical intent of writing, competition, and copyright. Fanfiction is generally a protected form of expression, but requires the work to be "transformative". Unlike with parodies and critisisms, fanfiction can be much harder to distinguish from original work. From that perspective, a large amount of the output of LLMs is so generic, that it's not possible to attribute it to one person. It's like trying to find the original author of "Once upon a time".

https://theinnisherald.com/the-other-once-upon-a-times-a-his...


Fanfiction isn't as protected as many people think it is.

https://en.wikipedia.org/wiki/Legal_issues_with_fan_fiction

Fanfiction and fan art also tend to run afoul of the infrequently (but occasionally) litigated part of copyright - copyright of fictional characters.

https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...

I came across this with the Eleanor lawsuits - https://www.caranddriver.com/news/a42233053/shelby-estate-wi... - and while I believe that that instance Eleanor falls on the "this shouldn't have been copyrightable" (took a bit to get there), the question is "what protects the representation of Darth Vader?"

In general it tends to be ignored and tacitly encouraged... but it isn't protected.


It's more like a mirror-house of human thought. It can create countless arrangements and even execute tasks.


> Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...

It's been a bit surreal seeing modern day Luddites come out of the wood works basically coming up with any ethical/legal argument they can that is a thinly veiled way of saying "I don't want to be automated!"

Not commenting on whether or not they are right per se, but it's weird seeing history repeat itself.


I don't think it's a matter of right or wrong - these are people who are behaving completely rationally given their context.

(I should caveat that I think if they get what they want, we all lose in a big way. Not that I think this is going anywhere)

We're coming up on the outer bounds of our systems of incentives. Captialism, as a system, is designed to solve for scarcity, both in terms of resources and in terms of skill and effort. Unfortunately, one of the core mechanisms it operates on is that it's all-or-nothing. You MUST find a scarcity to solve or you divorce yourself from the flow of capital (and starve / become homeless as a result).

Thus, artificial scarcity. It's easy to spot in places like manufacturing (planned obsolescence) IP (drug / software / etc patents) and so forth. I think this is just the rest of humanity both catching on and being caught up with. Two years ago, everyone thought they had a moat by virtue of being human. That's no longer a given.

One hopes that we'll collectively notice the rot in the foundation before the house falls over (and, critically, figure out how to act on it. We have a real problem with collective action these days that may well put us all in the ground).


As far as I remember Luddites were smart and not against all technology, they were just protecting their jobs. And they were ultimately right.

Why? Except for the longshoremen in the US getting compensation and an early retirement due to the introduction of containers, I know of exactly 0 (ZERO!) mass professional reconversions after a technological revolution.

Look at deindustrialization in the US, UK, Western Europe.

When this happens, the affected people are basically thrown in the trash heap for the rest of their lives.

Frequently their kids and grandkids, too.


Stables became gas stations. Nintendo used to be a toymaker.

Businesses change and adapt. Workers too — but people often don’t like change, so many choose to stay behind. Should we cater to them?

I used to do a lot of work which is now mostly automated. Things like sysadmin work, spinning up instances and configuring them manually, maintaining them. I reconverted and learned terraform, aws etc when it became popular.

Should I have gotten help from the government to instead stick to old style sysadmin work?


> Should I have gotten help from the government to instead stick to old style sysadmin work?

I don't think anyone beyond a few marginal voices are calling for a ban on job automation. What they seem to prefer is that, if they are to be automated out of a job, they should be compensated for their copyrighted works having been used in the process of doing so.

Regardless, at the very least people who are being automated should get some government support. Not everyone can easily retrain.


Suppose you're a weaver. It's hard, fiddly work, and you have to get your timing and your tension just right to make quality material. Now, there are mechanised looms that can do the job faster (though the quality's not great: they could still do with some improvement, in your opinion). From this efficiency gain, who should reap the profits?

Suppose you're a farmer. You've been working on your tractors for decades, and have even showed the nice folk at John Deere how you do it. Now they've built your improvements into the mass-produced models, and they say you can't work on your tractors any more. Who should reap the profits?

Suppose you're a writer. You've spent a long time reading and writing, producing essays and articles and books and poems and plays, honing your craft. You've got quite a few choice phrases and figures of speech in your back pocket, for when you want to give a particular impression. Now, there is a great big statistical model that can vomit your coinages (mixed in with others') all over the page, about any topic, in mere minutes. Who should reap the profits?

Suppose you're a visual artist. You enjoy spending your time making depictions of fantasy scenes: you have a vivid imagination, and, so you can make a living illustrating book covers and the like. You put your portfolio online, because why not? It doesn't hurt you, it makes others happy, and maybe it gets you an extra gig or two, now and then. Except now, there's a great big latent diffusion model. Plug in “Trending on Artstation by Greg Rutkowski”, and it will spit out detailed fantasy scenes, photorealistic people, the works. Nothing particularly novel, but there was so much creativity and diversity in your artwork, that few have the eye to notice the machine's subtle unoriginality. Who should reap the profits?


I've answered this before. The container revolution split some of the resulting profits with those whose livelihoods were destroyed, the longshoremen.

"You build a dam that destroys 10000 homes, who should reap the profits?"


It's a good answer, but it raises further questions:

• Should we be destroying people's homes to build dams without their consent?

• In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?

The Luddites (the real ones, not the mythological bastardisation of them) continue to be sympathetic characters.


> • In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?

The famous: "it depends" :-)

AI most likely falls under: "they should be", IMHO.


I don't think we should cater to Luddites, but (and it's a big but) if we automate enough jobs out of existence it's essentially undeniable that we will need systemic changes to avoid becoming a completely dystopian society.


But as the corollary to that, I know of zero successfully stopped technological revolutions. You can't put the genie back in the bottle, and there is no way to stop progress, aside from a one-world authoritarian government that forcibly stops as much of it as they can. But even that would only be marginally effective. Progress would eventually resume.


Yes, you do know of revolutions stopped and it worked for centuries.

Tokugawa Japan, Qing China, many other places including in Europe for centuries.

That's too extreme.

My point is that we're reaching a point where people need to be compensated. We can't just destroy their lives, collect all the money in 2 bank accounts and call it a day.


Bingo.

That's the real flaw in Luddite thinking -- you can destroy the machines.


In this case I think it's a little different. People are saying that they don't want to have their own productive or creative output used to undermine their own standard of living. That's not the same as simply not wanting to have your job automated away by someone else's business innovation.


To make chatGPT analogous to coal mining automation it would have to be able to automate the thing it is doing without learning from sources online.

To make coal mining automation analogous to chatGPT the machinery would have had to use something the coal miner did to learn how to automate their work? I'm imagining a camera looking at all the coal miner's work and then the machine can immediately do it, but better.

I agree it is a tad different, but like with someone's coal mining which is in the public domain for anyone in the tunnel to see, likewise anything you write unprotected online is in the public domain and fair game I think?


The lawsuit is far more nuanced than you're letting on. There are several aspects that come into play-

* Was it published publicly? This is basically defined in the courts as "if you make an unauthenticated web request does the data return?". This is where scraping comes in- if you make the data available without authentication you can't enforce your TOS, because you can't validate that people actually even accepted the TOS to begin with.

* Is the data able to be copyrighted? This is where things are interesting- facts can not be copyrighted, which is why a lot of scrapers are able to reuse data (things like weather, sports scores, even "for hire" notices can be considered factual).

* If it would typically be considered covered by copyright, does fair use come into play?

* Are there any other laws that come into play? For example, GDPR, CCPA, or other privacy laws can still add restrictions to how data is collected and used (this is complicated by the various jurisdictions as well)

* Was the work done with the data transformative enough to allow it to bypass copyright protections? This goes back to when Google was scanning books. Because they were making a search engine, not a library, their search tool was considered transformative enough to allow them to continue.

It's not enough to say "because it's on the internet, it's fair game for everyone to use". This is a really complicated area where things are evolving rapidly, and there's a lot of intersecting law (and case law) that comes into play.


I agree that there is additional nuance, but so far public data scraping has very clearly been ruled as legal. It's possible that at the time of scraping, copyrighted data was incorporated into the training data because it hadn't been taken down by the host platform yet. But in my opinion, the core idea proposed by the suit that private data was used intentionally, is not true. The GPT4 browsing plugin is equivalent to web scraping.

And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.

I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.


My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.

To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.

The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.


What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.


That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.

You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).

That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.

Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.


I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.

https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...


Even if they were exposing static data, how would that be different than a search engine? Google has been scraping the web for two decades, indexing even explicitly copyrighted content, and then making money by selling ads next to snippets from that content. If you're going to make the case that an LLM is violating copyright, then surely you must also assert that Google is too, because it's the same concept, but Google is actually surfacing exact text from the copyrighted material.


By putting something on a public-facing website, it's generally agreed that (absent a robots.txt to the contrary), you intend it to appear in web search results, and you're granting a public limited semi-transferable revocable license to request, download and view your site to your visitors.

That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)


These things are just not going to go anywhere, big reason being AI is part of the technological race. If AI research gets constrained in the US, progress will happen in China. Since that can't happen, this won't go anywhere.


I tend to agree with you, but I also recognize I could be unrealistically optimistic. This is the legal system we're talking about. I wouldn't expect every court case to be decided fairly, nor would I expect any new laws and regulations to necessarily be sensible. Frankly my biggest worry at this point is that regulatory capture from the first mover AI companies will stop me from purchasing more than one GPU.

I'm not too worried about copyright issues because regardless of whatever happens with upcoming case law and legislation, any regulation against the input data will be totally unenforceable. It's nearly impossible to detect whether or not an LLM was trained on some corpus of data (although maybe there is some "trap street" equivalent that could retroactively catch an LLM trained on data it wasn't allowed to read). And even if the weights of a model are found to be in violation of some copyright, it's still not enforceable to forbid them, because they're just a bag of numbers that can be torrented and used to surreptitiously power all sorts of black boxes. That's why I'm much more worried about legislative restrictions on hardware purchases.


> I hope it leads to some clear laws regarding data privacy and how TOS is binding

I hope it leads to more people realizing that a TOS doesnt override their individual rights and that the legal system works to support them.


One individual right is the right to sign away other rights in exchange for products and services.


There are limits to that -- to signing away rights. In the US You can't sign yourself into slavery. You can't sell the right to have someone kill you.

There's sort of an exception for military service, but even soldiers have acess to military courts.


Can you point to where that "right" is codified in law?


Common law of contracts dictates that you can commit to performing certain services in exchange for the counter-party performing certain services. For example, you provide both money, viewing data, and permission to run DRM and proprietary code on your property (e.g. set-top boxes or smart TVs) to Netflix in exchange for obtaining access to their library of TV shows and movies.

It's codified in the fact that saying you'll do something means you're socially obligated to do it, and legally obligated if you receive something in return.


You still haven't said where it's legal that all rights can be signed away. I know for a fact that you can't waive tenant rights when signing a lease, for example. We also don't allow people to sign over so many rights that they're considered slaves, as slavery has been made illegal. I also can't sign away my right to not be sexually harassed- if a company makes me sign something saying that they can sexually harass me they will still end up losing in court. The US has also limited the ability for NDAs to cover discussions about labor practices, so there's another right we can't sign away.

It seems to me there are a to of counter examples to this "right" you speak of. So many that it doesn't seem like it really exists.


It is open knowledge that ~0% of people read any TOS. While ignorance is no defense for breaking laws or rules, ~0% is compelling in and of itself that the process is completely broken.


> People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.

The same argument could be used to defend ubiquitous face recognition in the street though (“when going to the street, there's an implicit assumption that your presence in this place was public”) but I'd really like if we could not have that…

There's a case to be made that corporation gathering data and training artificial intelligence don't need to have the same right as people: when I go to the street or publish something on Reddit, I'm implicitly allowing other people to read my comments, but not corporations to monetize it. (GDPR and the likes already makes this kind of distinctions for personal information by the way, so we can totally extend it to any kind of online activity).


It becomes harder and harder to pretend that this level of data scraping and disregard for consumer privacy is acceptable when things like GDPR exist.

Just because I posted something on reddit because I thought it was funny, doesn't implicitly give permission to anybody to take that post and profit from it. You're doing a disservice to consumers by acting like it's their fault for being exploited.


The fundamental issue in that situation isn't about profit, it's about the definition of what is considered publicly accessible and what consent that implies.

I disagree with you on whether it should count as being exploited. I don't see fanfiction writers professional impersonators or as inherently exploitative. I understand that some people would disagree because there is a difference in scale. But using technology to mimic and, in some sense, replace human effort is the reason it is useful.

I believe this will shift how and why people value organic media. The standard of what makes content "good" will rise in the long term. When stable diffusion first came out, I compared the generated art to the elevator music. I feel the same way about the output of LLMs. I might feel differently in a few years if models get better at the rate they currently have been, but that's not likely.

I agree that people should have more control over how their data is used, and I'd love to see this suit lead to stricter laws.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: