I just tested it myself on a random c file I created in the middle of a rust pro...

shadowgovt · on Oct 16, 2022

Searching for the function names in his libraries, I'm seeing some 32,000 hits.

I suspect he has a different problem which (thanks to Microsoft) is now a problem he has to care about: his code probably shows up in one or more repos copy-pasted with improper LGPL attribution. There'd be no way for Copilot to know that had happened, and it would have mixed in the code.

(As a side note: understanding why an ML engine outputs a particular result is still an open area of research AFAIK.)

andreareina · on Oct 17, 2022

"It's too hard" isn't a valid reason for me to not follow laws and/or social norms. This is a predictable result and was predicted by many people; "oops we didn't know" is neither credible nor acceptable.

Spivak · on Oct 17, 2022

It’s not “oops we didn’t know” it’s, “someone published a project under a permissive license which included this code.”

If your standard is “Github should have an oracle to the US court system and predict what the outcome of a lawsuit alleging copyright infringement for a given snippet of code would be” then it is literally impossible for anyone to use any open source code ever because it might contain infringing code.

There is no chain of custody for this kind of thing which is what it would require.

vincnetas · on Oct 17, 2022

This reminds me my 4 year old daughter. She often comes from kindergarten with new toys. When i ask here, where did she get it, she tells that her friend gave this as a gift to her. When i dig deeper and ask around, i turns out that the friend who were gifting her things were not real owners of the gift. I see why i could be difficult for children to understand concept of ownership and that you should not gift things to others that are not your own.

So in this case copilot just looks at the situation like that someone gifted me this, and does not question if the person gifting was the real owner of the gift.

Spivak · on Oct 17, 2022

> and does not question if the person gifting was the real owner of the gift

If you can figure out a method of determining whether someone owns the code that doesn't involve, "try suing in court for copyright infringement and see if it sticks" then we're kinda stuck. Because just because a codebase contains an exact or similar snippet from another codebase doesn't mean that snippet reaches the threshold of copyrightable work. Or the reverse being that just because two code snippets look wildly different doesn't mean it's not infringement and detecting that automatically is solving the halting problem.

The thing you want for software to actually solve this is chain of custody which we don't have. If you require everyone assume everyone else could be lying or mistaken about infringement then using any open source project for anything becomes legal hot water.

In fact when you upload code to Github you grant them a license to do things like "display it" which you can't do if you don't actually own the copyright or have a license so even before the code is ever slurrped into Copilot the same exact legal situation arises as to wether Github is legally allowed to host the code at all. Can you imagine if when you uploaded code to Github you had to sign a document saying you owned the code and indemnifying Microsoft against any lawsuit alleging infringement o boy people would not enjoy that.

vincnetas · on Oct 18, 2022

I'll flip it around. If you can't figure out if the code is properly copyrighted, and can't afford to face consequences, don't use it.

polymatter · on Oct 17, 2022

Exactly, the chain of custody is absolutely required for this to be legal because no oracle can exist. It must be able to attribute exactly who contributed the suspect code. It must be able to handle the edge case where some humans might publish code without permission.

Either that or we effectively get rid of software copyright as copilot can be used (or even claim to be used) to launder code of license restrictions. Eg No I didn't copy your code, I used copilot and it copied your code so I did nothing wrong.

concordDance · on Oct 17, 2022

This takes place with or without copilot. The problem would be people copying code and releasing it under a different license.

roer · on Oct 17, 2022

Right, so we need a system for when a dev goes and grabs code-snippets from blogs and open-source freely licensed projects on e.g. github in which they can say that the code is from so-and-so source? So like a way to distribute and inherit git blame?

schwartzworld · on Oct 17, 2022

If someone created an AI for making movies, and it started spitting out star wars and marvel stuff, you can bet them saying "we trained it on other materials that violate copywriter" wouldn't be enough. They are banking on most devs not knowing, caring or having the ability to follow through on this.

cowtools · on Oct 17, 2022

I am going to make a robot that burns your house down. You might think this is unethical, but what you expect me to do? Implement an oracle to the US court system?

You might think it's unreasonable to build such a house-burning robot, but you have to realize that I actually designed it as a lawn-mowing robot. The robot will simply not value your life or property because its utility function is descended from my own, so may burn your house down in the regular course of its duties (if it decides to use a powerful laser to trim the grass to the exact nanometer). Sorry neighbor.

What do you expect me to do? NOT build this robot? How dare you stand in the way of progress!

ianbutler · on Oct 16, 2022

Yeah that's a mess, but that's way too much legal baggage for me, an otherwise innocent end user, to want to take on. Especially when I personally tend to try and monetize a lot of my work.

I understand there's no way for the model to know, but it's really on Microsoft then to ensure no private, or poorly licensed or proprietary code is included in the training set. That sounds like a very tall order, but I think they're going to have to otherwise they're eventually going to run into legal problems with someone who has enough money to make it hurt for them.

twaw · on Oct 17, 2022

Open source code is open source when it license is obeyed only. When license is not obeyed, e.g. copyright notice is no reproduced, it should be treated as private code, except for dual-licensed code.

kwhitefoot · on Oct 17, 2022

Of course the model can know if the code is repeated in multiple repositories with different licences. The people who maintain copilot simply don't care to make it do so.

shadowgovt · on Oct 16, 2022

Agreed. Silver lining: MS is now heavily incentivized to invest in solutions for an open research problem.

enragedcacti · on Oct 16, 2022

Expanding on that, even if Microsoft sees the error of their ways and retrains copilot against permissively licensed source or with explicit opt-in, it may get trained on proprietary code the old version of copilot inserted into a permissively licensed project.

You would have to just hope that you can take down every instance of your code and keep it down, all while copilot keeps making more instances for the next version to train on and plagiarize.

scrame · on Oct 17, 2022

> Microsoft

> sees the error of their ways

You must be new here.

mattigames · on Oct 17, 2022

It doesn't matter that there is not way for copilot to know what happened, doing something illegal because hundreads of people did it before it's never a valid excuse under the rule of law; nor it is "I didn't know it was illegal". Regardless if it's copying code without permission or jaywalking.

neop1x · on Oct 17, 2022

>> his code probably shows up in one or more repos copy-pasted with improper LGPL attribution

That is why Copilot should have always been opt-in (explicitly ask original authors to provide their code to copilot training). Instead, they are simply stealing the code of others.

manholio · on Oct 17, 2022

> his code probably shows up in one or more repos copy-pasted with improper LGPL attribution.

Can Copilot prove that and link to the source LGPL code whenever it reproduces more than half a line of code from such a source?

Because without that clear attribution trail, nobody in their right mind would contaminate their codebase with possibly stolen code. Hell, some bad actor might purposefully publish a proprietary base full of stolen LGPL code, and run scanners on other products until they get a Copilot "bite". When that happens and you get sued, good luck finding the original open source code both you and your aggressor derive from.

cerved · on Oct 17, 2022

> There'd be no way for Copilot to know that had happened

All lines are associated to a commit, which has author/commit date. A reasonable guess as to which snippet was made first can be done

vintermann · on Oct 17, 2022

Well yes, there'd be no way for the copilot model, as currently specified and trained, to know.

But it IS possible to train a model for that. In fact, I believe ML models can be fantastic "code archaeologists", giving us insights into not just direct copying, but inspiration and idioms as well. They don't just have the code, they have commit histories with timestamps.

A causal fact which these models could incorporate, is that we know data from the past wasn't influenced by data from the future. I believe that is a lever to pry open a lot of wondrous discoveries, and I can't wait until a model with this causal assumption is let loose on Spotify's catalog, and we get a computer's perspective on who influenced who.

But in the meantime, discovering where copy-pasted code originated should be a lot easier.

pca006132 · on Oct 17, 2022

Ah, a plagiarism checker that can understand simple code transformation and find the original source? Sounds like a good idea for patent trolls and I have no idea about how/if copyright laws can be apply in this case. Does copying the idea but not copying the code verbatim constitutes copyright violation?

vintermann · on Oct 17, 2022

The patent troll version of the algorithm needs the victim's bank balance as input too. In fact that's probably all it needs.

It would be much more valuable for people who care about the truth.

chiefalchemist · on Oct 16, 2022

I thought the same thing. But then shouldn't CP look at things it's not supposed to use and see if that's happened? How is that any different than you committing your API to Platform X and shortly thereafter Platform X reaches out to you...because GH let them know?

armchairhacker · on Oct 17, 2022

This is exactly what I was thinking. It's still a legal headache for Microsoft but it's not like they're just blatantly ignoring the license

kitsune_ · on Oct 17, 2022

There'd be no way for Copilot to know that had happened? What? YT uses Content ID. GH could set up a similar program for OSS.

kbelder · on Oct 17, 2022

>GH could set up a similar program for OSS.

What a nightmare.

I'd say that constant code copying is massively pervasive, with no regard to licensing, always has been, and that's not really a bad thing, and attempts to stop it are going to be far more harmful than helpful.

bamboozled · on Oct 17, 2022

So I’m working on a side project and it’s hosted on GitHub, does this mean the code (which I consider precious and my own) can just be stolen and injected into my competitors codebase who is using co-pilot?

If this is the case, I can imagine people migrating of GitHub very quickly. I can also imagine some pretty nice lawsuits opening up.

concordDance · on Oct 17, 2022

A good rule of thumb is if you're worried about code being copied you really shouldn't put it on github. Even if most large companies respect copyright, that small studio in Russia certainly won't.

bamboozled · on Oct 17, 2022

How would the small studio access my private repo ?

frankacter · on Oct 18, 2022

They wouldn't. If you have it in a private repository and you are the only one with access to it, you'll likely not run into this issue.

From other comments, this developers "private" code was found in 30k+ public repositories with public attribution which is what created this issue.

Presumably your private code is not also present or leaked to any public repositories.

bamboozled · on Oct 20, 2022

The, this brings up a very, very interesting question, people have stolen code or leaked it into public repositories, and Microsoft are building a product that references that code.

naikrovek · on Oct 16, 2022

what proprietary code? the guy on Twitter is seeing his own GPL code bring reproduced. nothing proprietary there.

do you have the "don't reproduce code verbatim" preference set?

ianbutler · on Oct 16, 2022

Sorry it would likely be more correct to say "improperly licensed" code and not proprietary. Still for someone like me, the possibility of having LGPL, or any GPL licensed code generated in their project is a solid no thanks. I know others may think differently but those are toxic licenses to me.

Not to mention this code wasn't public so it's kind of moot, having someone's private code be generated into my project is very bad.

As to the option, I do not, I wasn't even aware of the option, but it's pretty silly to me that's not on by default, or even really an option. That should probably be enabled with no way to toggle it without editing the extension.

naikrovek · on Oct 17, 2022

[flagged]

ianbutler · on Oct 17, 2022

You said, "don't reproduce code verbatim" as the option. Maybe you should use the actual name of the option if you want people to know what you're talking about. The "with public code blocked" option I am aware of, and is distinctly different from what you said. Don't be a dick, and if you want to be a dick, be right the first time or piss off.

naikrovek · on Oct 18, 2022

well unlike you I am human. I say things differently at different times.

webstrand · on Oct 16, 2022

He owns the copyright to the code, and the code is not in the public domain, therefore it is proprietary code.

yjftsjthsd-h · on Oct 16, 2022

That's not how anybody uses the word proprietary when dealing with software licensing. It's a term of art that stands in contrast to open source licenses.

ianbutler · on Oct 16, 2022

For the record, I don't typically think in terms of the open source community.

I grant that if most people are using it one way here I was likely wrong for the way it is typically used by the normal open source community, I followed up with a reply saying it would likely be more correct for me to have said "improperly licensed" to be included in the training set.

Still it being private means it probably shouldn't be in the training set anyway regardless of license, because in the future, truly proprietary code could be included, or code without any license which reserves all right to the creator.

naikrovek · on Oct 17, 2022

that is not what proprietary means and you know it.

IncRnd · on Oct 16, 2022

The code is not GPL but is copyrighted in his name.

aeyes · on Oct 17, 2022

It is also allowed to be used under LGPL terms: https://github.com/DrTimothyAldenDavis/SuiteSparse/blob/mast...

But that doesn't make it any better.

IncRnd · on Oct 17, 2022

Yes, but I was pointing that out to my parent poster, who erronesouly said the code was GPL licensed.

naikrovek · on Oct 17, 2022

some of it is lgpl and some of it is gpl. code contributed by others is licensed differently.

IncRnd · on Oct 17, 2022

The license for other code, even when in the same repo, doesn't apply to the license on the code that is under discussion here.

naikrovek · on Oct 17, 2022

GPL'd code has a copyright owner.

those two things exist at the same time.

try reading a licence now and again!

IncRnd · on Oct 17, 2022

Slow down there with the snarky comments. I never said GPL'd code doesn't have a copyright owner.

twaw · on Oct 17, 2022

Any code is proprietary by default. GNU GPL license lifts some restrictions, in excange of more code, but it doesnt work when license is broken. Look at cases about GPL violation.

Copiloot doesn't obey GPL license, so they need to obtain written permission and pay license fees to be able to use code in their product.

IncRnd · on Oct 17, 2022

Thank you for that, though it seems you replied to the wrong comment. That doesn't have to do with what I wrote.