This is not a suddenly emerging "gotcha". This risk has *always* existed. It exi...

larodi · 2025-05-02T06:45:42 1746168342

> "if your business value is your codebase, it's hard to build a business whilst literally giving it away".

perhaps then it comes as no surprise that some very outspoken open-source proponents do not open-source their core business components. I can understand they do it in order to exist as a company, as busineses, but I don't understand why they have to endure being shamed for staying closed-source, while all their stack is open. many such companies exist.

and let's add to this all the fact that everything released in 2025 as open-source gets slurped by LLMs for training. so, essentially, you feed bro's models breakfast with your open-source, like it or not. in the very near future we'll be able to perhaps prompt a 'retell' of Redis which is both the same software, but written so that it is not.

in essence there seems to be now little reason to open-source anything at all, and particularly if this is a core-business logic. open-source if you can outpace everybody else (with tech), or else you shouldn't care about the revenue.

miki123211 · 2025-05-02T17:47:52 1746208072

A sufficiently capable LLM might be good enough to do cleanroom design on its own, with little to no human assistance. That would destroy the entire idea of copyright as it exists for software.

You need one agent that can write a complete specification of any piece of software, either just by using it and inferring how it works, or by reverse engineering if not prohibited by the license. You then have a lawyer in the middle (human or LLM) review it, removing any bits that are copyrighted. You then need another agent that can re-implement that spec. You just made a perfectly legal clone.

Cleanroom design is a well-established precedent in the US, and has been used before, just with teams of humans instead of LLMs.

I think some companies will be completely unaffected by this, as either the behavior of their code can't easily be infered just from API calls, or because their value actually lies in users / data / business relationships, not the code directly. Stripe would be my go-to example, you can't just reverse-engineer all their optimizations and antifraud models just by getting a developer API key and calling their API endpoints. They also have a lot of relationships with banks and other institutions, which are arguably just as important to their business as their code. Instagram, Uber or Amazon also fall into this bucket somewhat.

dcow · 2025-05-03T06:25:18 1746253518

Humans can also do that, they’re just slower and presumably more expensive.

eek2121 · 2025-05-03T19:55:15 1746302115

Except that LLMs themselves are close to being killed off for the lack of clean room implementations themselves, at least here in the U.S.

palata · 2025-05-04T10:30:48 1746354648

One can hope, but I don't think it will happen.

bloomburger · 2025-05-03T21:09:04 1746306544

??????

palata · 2025-05-02T07:06:22 1746169582

I agree with the fact that LLMs are big open-source laundering machines, and that is a problem.

I mostly see it as a problem for copyleft licences. Permissive don't protect the users in the first place, so...

pbronez · 2025-05-02T11:20:28 1746184828

So who’s gonna sue an AI company asserting that all code they produce is GPL due to being trained on GPL code?

growse · 2025-05-02T13:06:13 1746191173

How is training a model on GPL code and then having it write code any different to having a human read GPL code and then write code?

Unless there's a specific copyright claim over a specific piece of code that was copied and published, it's hard to see how the GPL has any relevance.

bccdee · 2025-05-02T16:01:50 1746201710

Because, unlike humans, LLMs reliably reproduce exact excerpts from their training data. It's very easy to get image generation models to spit out screenshots from movies.

growse · 2025-05-03T08:02:33 1746259353

That doesn't mean that all of the output from an LLM trained on GPL code is a derivative work (and therefore GPL'd too).

bccdee · 2025-05-05T13:03:34 1746450214

A model that provably engages in systematic, difficult-to-detect plagiarism must itself be considered plagiaristic.

palata · 2025-05-02T16:29:15 1746203355

I see that argument over and over, and I don't understand how people can consider it makes sense.

"My clipboard learned the code, just like a human would. So it should be fine to copy-paste anything and call it my own".

"How is killing a human any different to killing a computer?"

"If humans can vote, why couldn't computers vote as well?"

Can we start at "humans are not computers", maybe?

growse · 2025-05-02T19:33:54 1746214434

> Can we start at "humans are not computers", maybe?

Sure. So it stands to reason that "computers" are not bound by human laws. So an LLM that finds a piece of copyright data out there on the internet, downloads it, and republishes it has not broken any law? It certainly can't be prosecuted.

My original point was that copyright protections are about (amongst other things) protecting distribution and derivative works rights. I'm not seeing a coherent argument that feeding a copyrighted work (that you obtained legally) into a machine is breaching anyone's copyright.

palata · 2025-05-02T22:54:27 1746226467

> So an LLM that finds a piece of copyright data out there on the internet, downloads it, and republishes it has not broken any law?

Are you even trying? A gun that kills a person has not broken any law? It certainly can't be prosecuted.

> I'm not seeing a coherent argument that feeding a copyrighted work (that you obtained legally) into a machine is breaching anyone's copyright.

So you don't see how having an automated blackbox that takes copyrighted material as an input and provides a competing alternative that can't be proven to come from the input goes against the idea of copyright protections?

growse · 2025-05-03T08:05:34 1746259534

> So you don't see how having an automated blackbox that takes copyrighted material as an input and provides a competing alternative that can't be proven to come from the input goes against the idea of copyright protections?

Semantically, this is the same as a human reading all of Tom Clancy and then writing a fast-paced action/war/tension novel.

Is that in breach of copyright?

Copyright protects the expression of an idea. Not the idea.

palata · 2025-05-03T14:15:19 1746281719

> Copyright protects the expression of an idea. Not the idea.

Copyright laws were written before LLMs. Because a new technology can completely bypass the law doesn't mean that it is okay.

If I write a novel, I deserve credit for it and I deserve the right to sell it and to prevent somebody else from selling it in their name. If I was allowed to just copy any book and sell it, I could sell it for much cheaper because I didn't spend a year writing it. And the author would be screwed because people would buy my version (cheaper) and would possibly never even hear of the original author (say if my process of copying everything is good enough and I make a "Netflix of stolen books").

Now if I take the book, have it automatically translated by a program and sell it in my name, that's also illegal, right? Even though it may be harder to detect: say I translate a Spanish book to Mandarin, someone would need to realise that I "stole" the Spanish book. But we wouldn't want this to be legal, would we?

An LLM does that in a way that is much harder to detect. In the era of LLMs, if I write a technical blog, nobody will ever see it because they will get the information from the LLM that trained on my blog. If I open source code, nobody will ever see it if they can just ask their LLM to write an entire program that does the same thing. But chances are that the LLM couldn't have done it without having trained on my code. So the LLM is "stealing" my work.

You could say "the solution is to not open source anything", but that's not enough: art (movie, books, paintings, ...) fundamentally has to be shown and can therefore be trained on. LLMs bring us towards a point where open source, source available or proprietary, none of those concepts will matter: if you manage to train your LLM on that code (even proprietary code that was illegally leaked), you'll have essentially stolen it in a way that may be impossible to detect.

How in the world does it sound like it is a desirable future?

rank0 · 2025-05-03T00:51:08 1746233468

> A gun that kills a person has not broken any law? It certainly can't be prosecuted.

Yeah dude…its an inanimate object.

palata · 2025-05-03T14:16:30 1746281790

Maybe I need to explain it: my point is that the one responsible is the human behind the gun... or behind the LLM. The argument that "an LLM cannot do anything illegal because it is not a human" is nonsense: it is operated by a human.

palata · 2025-05-02T13:04:04 1746191044

I feel like nobody cares. It sucks, I know. Like climate change, biodiversity loss, the energy crisis.

Feels like we're pretty much screwed. Doesn't mean it's not a problem.

motorest · 2025-05-02T15:22:16 1746199336

> I agree with the fact that LLMs are big open-source laundering machines, and that is a problem.

Why do you believe this is a problem? I mean, to believe that you first need to believe that having access to the source code is somehow a problem.

> I mostly see it as a problem for copyleft licences.

Nonsense.

At most, the problem lies in people ignoring what rights a FLOSS license grants to end users, and then feigning surprise when end users use their software just as the FLOSS license intended.

Also a telltale sign is the fact that these blind criticisms single out very precise corporations. Apparently they have absolutely no issue if any other cloud provider sells managed services. They single out AWS but completely ignore the fact that the organization behind ValKey includes the likes of Google, Ericsson, and even Oracle of all things. Somehow only AWS is the problem.

palata · 2025-05-02T16:26:30 1746203190

> I mean, to believe that you first need to believe that having access to the source code is somehow a problem.

How in the world did you get there from what I said? Open source code has a licence that says what the copyright owner allows or not. LLMs are laundering machine in the sense that they allow anybody to just ignore licences and copyright in all code (even proprietary code: if you manage to train on the code of Windows without getting caught, you're good).

> At most, the problem lies in people ignoring what rights a FLOSS license grants to end users

Once it's been used to train an LLM, there is no right anymore. The licence, copyright, all that is worthless.

> Also a telltale sign is the fact that these blind criticisms [...]

No clue what you are talking about here.

motorest · 2025-05-02T17:08:28 1746205708

> LLMs are laundering machine in the sense that they allow anybody to just ignore licences and copyright in all code (...)

No. Having access to the code does that. You only need a single determined engineer to do that. I mean, do you believe that until the inception of LLMs the world was completely unaware of the whole concept of reverse engineering stuff?

> Once it's been used to train an LLM, there is no right anymore.

Nonsense. You do not lose your rights to your work just because someone used a glorified template engine to write something similar. In fact, your whole blend of comment conveys a complete lack of experience using LLMs in coding applications, because all major assistant coding services do enforce copyright filters even when asking questions.

palata · 2025-05-02T22:47:01 1746226021

> do you believe that until the inception of LLMs the world was completely unaware of the whole concept of reverse engineering stuff?

The scale makes all the difference! A single determined engineer, in their whole life, cannot remotely read all the code that goes into the training phase. How in the world can you believe it is the same thing?

> Nonsense. You do not lose your rights to your work just because [...]

It is only nonsense if you don't try to understand what I'm saying. What I am saying is that if it is impossible to prove that the LLM was trained with copyrighted material, then the copyright doesn't matter.

But maybe your single determined engineer can reverse engineer any trained LLM and extract the copyright code that was used in the training?

necovek · 2025-05-02T12:51:46 1746190306

AI companies have not shied away from slurping anything available with complete disregard for the licensing of the material.

As such, code available is as much a boon to them as open source.

palata · 2025-05-03T14:17:08 1746281828

And leaked proprietary code.

vin047 · 2025-05-02T12:26:18 1746188778

This is exactly what the AGPL was made to combat against. But open source devs still choose more permissive licenses first - presumably to attract corporate clients to use their product (and because devs are suckers to large corporate interests)

palata · 2025-05-02T13:05:14 1746191114

This. They choose a permissive licence, proudly advertise it ("use us instead of our competitor because they are copyleft and we are not"), and then come whining when other competitors benefit from the very fact that they chose a permissive licence.

_msw_ · 2025-05-02T15:23:48 1746199428

There are different FOSS communities that hold different values. I come from the copyleft camp because I want to advance Software Freedom objectives for end-users. Others are more interested in advancing software developer freedom, and they find the obligations that are designed to advance end-user rights are unduly burdensome to the software developer. Articles like the one on the FreeBSD website [1] explain why they take a different position

I choose to believe that both of these sub-communities of the larger FOSS community are principled in their beliefs. I don’t see whining from FreeBSD folks about competitors, or for-profit companies using all the permissions they give with their choice of license.

[1] https://docs.freebsd.org/en/articles/bsdl-gpl/

palata · 2025-05-02T16:33:00 1746203580

> I don’t see whining from FreeBSD folks about competitors

Sure! Then that's all good! I have nothing against the use of permissive licences (though I am on the copyleft camp too, obviously). Or put it in the public domain.

My problem is with those who do and then whine about it.

_msw_ · 2025-05-02T23:12:04 1746227524

Yeah, that really grinds my gears too.

It especially bugs me when company blogs call out “abuse” when they only exist as a company because others gave them the permissions needed to build a business on software they did not author themselves!

palata · 2025-05-02T07:04:25 1746169465

> This risk has always existed.

Doesn't mean it was not a problem.

growse · 2025-05-02T07:33:25 1746171205

The solution to the old problem of "what if someone uses my code to compete with me" is "don't open source your code".

This isn't complicated. It's trade secrets 101.

I'm being disingenuous though. Of course the bait-and-switch merchants know this, they're just banking on getting enough product momentum off the free labour of others before they cash in. That's the plan all along.

plufz · 2025-05-02T11:13:48 1746184428

I think that is a little unfair. I don't know anything about the companies behind Redis and Elastic. But another possibility is that they want to make a good open source product and create some sort of business around it and find it difficult to make a waterproof moat. I'm sure that there are many other open source companies with the same basic strategy that are just more lucky, e.g. don't get AWS as competition.