It does address it, although not that clearly. This happens all the time with news media. They will post a picture and say they got permission from X person, but X person actually didn't even own the copyright in the first place. That doesn't make any of it okay, but it does mean that the organization has legal cover in this case and the worst that will happen is that they'll have to take the content down. In GitHub's case if that same code snippet is found in other repo's that have different licensing then it's difficult to really prove who owns the copyright, it's a legal issue between the original copyright owner and the person that re-distributed the work. They can submit a DCMA takedown notice for the other repo's. But it's pretty unlikely Github gets into any legal trouble as long as they can prove that they got the snippet from someone else.
That code seems to appear in thousands of repositories on GitHub, I’m sure some of them haven’t copied the license.
The vast majority of people who would use a matrix transform function they got from code completion (or from a GitHub or stack overflow search) probably don’t care what the license is. They’ll just paste in the code. To many developers publicly viewable code is in the public domain. Code pilot just shortens the search by a few seconds.
Microsoft should try todo better (I’m not sure how), but the sad fact is that trying to enforce a license on a code fragment is like dropping dollar bills on the sidewalk with a note pinned to them saying “do not buy candy with this dollar”
What’s the most github could reasonably be expected to do? Identify if multiple licenses are found for the same code then maybe it should be flagged for review or the most restrictive license applied.
Do we want that though? I personally believe copyright as implemented today is harmful. The fact that code largely is able to dodge this could be seen as arguing we should be laxer with copyright, rather than arguing for strict enforcement of copyright on code.
That would only work if the original was uploaded to GitHub before the copies. Like, somebody could copy from GitLab or BitBucket. And git histories don’t always help if they’re not copied over.
But copyright law doesn't really care about how you prevent infringement, just that it doesn't happen. Isn't it up to Github to come up with a way to do it, or otherwise not do it at all?
GitHub just needs to show they have taken reasonable precautions, and if a conflict is identified, that they remediate it without undue delay.
It’s not a binary all perfectly or nothing at all. The law looks at intent and so doesn’t punish mistakes or errors so long as you aren’t being malicious or reckless or negligent.
> No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider
So the act of hosting copyrighted content is not actually a copyright violation for Github. They're not obligated to preemptively determine who the original copyright owner of some piece of code is, as they're not the judge of that in the first place. Even if you complain that someone stole your code, how is Github supposed to know who's lying? Copyright is a legal issue between the copyright holder and the copyright infringer. So the only thing Github is required to do is to respond to DMCA takedown notices.
Yes. GitHub can get away with "oh well, we're all learning" because if the code is violating copyright, it's the user who is infringing directly by publishing it, not GitHub via Copilot. Either the user would have to bring a case against GitHub demonstrating liability (good luck) or the copyright holder would have to bring a case against GitHub demonstrating copyright violation (again, good luck). Otherwise this is entirely between the copyright holder and the Copilot user, legally speaking.
Of course if someone does manage to set a precedent that including copyrighted works in AI training data without an explicit license to do so, GitHub Copilot would be screwed and at best have to start over with a blank slate if they can't be grandfathered. But this would affect almost all products based on the recent advancements in AI and they're backed by fairly large companies (after all, GitHub is owned by Microsoft and a lot of the other AI stuff traces back to Alphabet and there are a lot of startups funded by huge and influential VC companies). Given the US's history of business-friendly legislation, I doubt we'll see copyright laws being enforced against training data unless someone upsets Disney.
Do you think that as part of this Github discovered that essentially everyone was in violation of copyright? That copyright of material without public knowledge or review (which exists in music, but not most code), is basically unenforceable?
Then they decided to wade in and build a house of cards where the cards are everyone else’s code, just waiting for the grenade pin puller and we’ve potentially witnessed the moment?
That’s the only thing that makes sense to me here. They don’t care because opening the issue will bring down everyone else with them.
Yeah, so if a news agency publishes a picture without knowing where it came from, the originator can sue them for violating copyright.
There is no “I don’t know who owns the IP” defense: the image has a copyright, a person owns that copyright, publishing the image without licensing or purchasing the copyright, is a violation. The fine is something like $100k per offense for a business.
FWIW this in consequence means you can't legally use Copilot without becoming liable to copyright violations because it's essentially a black box and you have no insight into where the code it generates originated and even if it isn't a 1-to-1 copy it might be a "derivative work".
This is why I'm gnashing my teeth whenever I hear companies being fine with their employees using Copilot for public-facing code. In terms of liability, this is like going back from package managers to copying code snippets of blogs and forum posts.
Why this restriction on public-facing code? Are you OK with Copilot being used for "private"/closed source code? I get that it would be less likely to be noticed if the code is not published, but (if I understand right) is even worse for license reasons.
I don't advocate people use Copilot for anything but hobby toy projects.
I have lower expectations of the rigor with which companies police their internal codebases, though. Seeing Copilot banned for internal use too is a pleasant surprise. Companies tend to be a lot more "liberal" in what kind of legal liabilities they accept for their internal tooling in my experience.
Turn the parties in this argument around and see if you think it still holds.
J. Random Hacker acquires and uses a copy of some of GitHub's, or Microsoft's source. When sued, the defense says that the code was not taken directly from GH/MS, just copied from a newsgroup where it had been posted. Does this get J. off the hook?
Was J using automated methods based on false claims of ownership by the newsgroup posters, with no direct knowledge of the violation? If so J should not be punished.
I may be misinformed but my understanding of copyright is that it protects the 'expression' of something (like an algorithm or recipe) so someone can rewrite a copyrighted chunk of code into another language and be free of the original copyright, while also able to assert their own copyright on their new expression.
If that is true then one way to get around copyright restrictions on existing code is to create a new language.
fascinating idea, copilot could do the translations internally and also work torwards widening the pool of suggestions to all languages instead of the individual lamguage a user is using (bit then again, they might be writing in the "new" language already