RAG and LLMs are not the same thing, but 'Agents' incorporate both. Maybe we cou...

martin-t · 2025-12-08T01:44:49 1765158289

> if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.

I think it should.

Sure, if you make a small amount of money and divide it among the 1000 people who deserve credit due to their work being used to create ("train") the model, it might be too small to bother.

But if actual AGI is achieved, then it has nearly infinite value. If said AGI is built on top of the work of the 1000 people, then almost infinity divided by 1000 is still a lot of money.

Of course, the real numbers are way larger, LLMs were trained on the work of at least 100M but perhaps over a billion of people. But the value they provide over a long enough timespan is also claimed to be astronomical (evidenced by the valuations of those companies). It's not just their employees who deserve a cut but everyone whose work was used to train them.

> Some people might consider this the OSS dream

I see the opposite. Code that was public but protected by copyleft can now be reused in private/proprietary software. All you need to do it push it through enough matmuls and some nonlinearities.

sholain · 2025-12-08T11:51:31 1765194691

- I don't think it's even reasonable to suggest that 1000 people all coming up with variations of some arbitrary bit of code either deserve credit - or certainly 'financial remuneration' because they wrote some arbitrary piece of code.

That scenario is already today very well accepted legally and morally etc as public domain.

- Copyleft is not OSS, it's a tiny variation of it, which is both highly ideological and impractical. Less than 2% of OSS projects are copyleft. It's a legit perspective obviously, but it hasn't bee representative for 20 years.

Whatever we do with AI, we already have a basic understanding of public domain, at least we can start from there.

martin-t · 2025-12-10T01:47:37 1765331257

> I don't think it's even reasonable to suggest that 1000 people all coming up with variations of some arbitrary bit of code either deserve credit

There's 8B people on the planet, probably ~100M can code to some degree[0]. Something only 1k people write is actually pretty rare.

Where would you draw the line? How many out of how many?

If I take a leaked bit of Google or MS or, god forbid, Oracle code and manage to find a variation of each small block in a few other projects, does it mean I can legally take the leaked code and use it for free?

Do you even realize to what lengths the tech companies went just a few years ago to protect their IP? People who ever even glanced at leaked code were prohibited from working on open source reimplementations.

> That scenario is already today very well accepted legally and morally etc as public domain.

1) Public domain is a legal concept, it has 0 relevance to morality.

2) Can you explain how you think this works? Can a person's work just automatically become public domain somehow by being too common?

> Copyleft is not OSS, it's a tiny variation of it, which is both highly ideological and impractical.

This sentence seems highly ideological. Linux is GPL, in fact, probably most SW on my non-work computer is GPL. It is very practical and works much better than commercial alternatives for me.

> Less than 2% of OSS projects are copyleft.

Where did you get this number? Using search engines, I get 20-30%.

[0]: It's the number of github users, though there's reportedly only ~25M professional SW devs, many more people can code but don't professionaly.

sholain · 2025-12-10T07:12:09 1765350729

+ Once again: 1000 K people coming up with some arbitrary bit of content is already understood in basically every legal regime in the world as 'public domain'.

"Can you explain how you think this works? Can a person's work just automatically become public domain somehow by being too common?"

Please ask ChatGPT for the breakdown but start with this: if someone writes something and does not copyright it, it's already in the 'public domain' and what the other 999 people do does not matter. Moreover, a lot of things are not copyrightable in the first place.

FYI I've worked at Fortune 50 Tech Companies, with 'Legal' and I know how sensitive they are - this is not a concern for them.

It's not a concern for anyone.

'One Person' reproduction -> now that is definitely a concern. That's what this is all about.

+ For OSS I think 20% number may come from those that are explicitly licensed. Out of 'all repos' it's a very tiny amount, of those that have specific licensing details it's closer to 20%. You can verify this yourself just by cruising repos. The breakdown could be different for popular projects, but in the context of AI and IP rights we're more concerned about 'small entities' being overstepped as the more institutional entities may have recourse and protections.

I think the way this will play out is if LLMs are producing material that could be considered infringing, then they'll get sued. If they don't - they won't.

And that's it.

It's why they don't release the training data - it's fully of stuff that is in legal grey area.