> Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.
> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)
> > Is it a valid defense against copyright infringement to say “we don’t know where we got it, maybe someone else copied it from you first?”
> I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem.
Aren't you moving the goal posts? This is not 3 lines, but instead is 1 to 1 reproducing a complex function that definitely has enough invention height to be copyright able.
With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.
It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
There should be a way to reverse engineer code LLMs to see which core bits of memorized code they build on. Another complex option is a combination of provenance tracking and semantic hashing on all functions in code used for training. Another option (non-technical) is a rethinking of IP.
>With high probability, what's happened here is this code is an important piece of code-infrastucture in that it's copied into a fair number of places. Which means humans are copying it without attribution or downstream of someone who did while relevant license is not propagated anywhere near as reliably.
The original poster said it was in a private repository.
>It doesn't change licensing issue but it does mean people are already copying and using copyrighted code without respecting original license and no AI involved.
I don't get the argument. Many people are copying/pirating MS windows/MS office. What do you think MS would say to a company they caught with unlicensed copies and they used the excuse "the PCs came preinstalled with Windows and we didn't check if there was a valid license"?
The training set for C was algol and a bunch of other languages.
AI could be used to create languages based on design criteria and constraints like C was, but it does bring up the question of why one of the constraints should be character encodings from human languages if the final generated language would never be used by humans...
I mainly think it's funny watching all of these Rand'ian objectivists reusing ever excuse used by every craftsman that was excised from working life...machines need a machinist, they don't have souls or creativity, etc.
Industry always saw open source as a way to cut cost. ML trained from open source has the capability to eliminate a giant sink of labor cost. They will use it to do so. Then they will use all of the arguments that people have parroted on this site for years to excuse it.
I'm a pessimist about the outcomes of this and other trends along with any potential responses to them.
The problem here is that copilot explots a loophole that allows it to produce derivative works without license. Copilot is not sophisticated enough to structure source code generally- it is overtrained. What is an overtrained neural network but memcpy?
the problem isn't even that this technology will eventually replace programmers: the problem is that it produces parts of the training set VERBATIM, sans copyright.
No, I am pretty optimistic that we will quickly come to a solution when we start using this to void all microsoft/github copyright.
I mean, in humans it's just referred to as 'experience', 'training', or 'creativity'. Unless your experience is job-only, all the code you write is based on some source you can't attribute combined with your own mental routine of "i've been given this problem and need to emit code to solve it". In fact, you might regularly violate copyright every time you write the same 3 lines of code that solve some common language workaround or problem. Maybe the solution is CoPilot accompanying each generation with a URL containing all of the run's weights and traces so that a court can unlock the URL upon court order to investigate copyright infringement.
> If someone violated the copyright of a song by sampling too much of it and released it in the public domain (or failed to claim it at all), and you take the entire sample from them, would that hold up in a legal setting? I doubt it.
In general you're not liable for this. While you still will likely have to go to court with the original copyright holder's work, all the damages you pay can be attributed to whoever defrauded or misrepresented ownership over that work. (I am not your lawyer)