The audacity and randomness of it - System of a Down with the lyrics 'I don't think you trust In my self-righteous suicide' in Chop Suey plays fine in the same playlist.
No objections about 'Hurt' by NIN/Johnny Cash also.
It's solving 'mechanical' problem. The optimistic twist on this helper is that it just raises the bar - human programmer should better be more useful than 'brainless' code generator - meaning not only being able to write a loop or solve leetcode task, but also understand context and what he's trying to solve for.
As you say typing code is not a bottleneck for problem solving
You bring up a really good point. I'm super curious what the legality and ethics around training machines on licensed or even proprietary code would be. IIRC there are implications around code you can build if you've seen proprietary code (I remember an article from HN about how bash had to be written by someone who hadn't seen the unix shell code or something like that).
How would we classify that legally when it comes to training and generating code? Would you argue the machine is just picking up best practices and patterns, or would you say it has gained specifically-licensed or proprietary knowledge?
I would argue that a trained model falls under the legal category of "compilation of facts".
More generally, keep in mind that the legal world, despite an apparent focus on definition is very bad at dealing with novelty, and most of it will end up justifying a posteriori existing practices.
A search engine provides snippets of other data. You can point explicitly to where it got that text from. A trained model generates its own new data, from influence of millions of different sources. It's entirely different.
This is a bit tricky, because at least in the U.S., I don't believe it's settled question in law yet. Some of the other posters on here have said that the resulting model isn't covered by GPL--that's partially true, but provenance of data, and the rights to it, definitely does matter. A good example of this was the Everalbum ruling, where the company was forced to delete both the data and the trained models used they were used to generate due to lack of consent from the users from whom the data was taken[1]. Since open source code is, well, open, it's definitely less a problem for permissively-licensed code.
That said, copyright is typically generally assigned to the closest human to the activation process (it's unlikely that Github is going to try to claim the copyright to code generated by Copilot over the human/company pair-programming with it), but since copyleft in general is a pretty domain-specific to software, afaik the way that courts interpret the legality of using code licensed under those terms in training data for a non-copyleft-producing model is still up in the air.
Obligatory IANAL, and also happy to adjust this info if someone has sources demonstrating updates on the current state.
No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").
But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".
"[...] the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"
If that's the case (only 0.1%), the developers must have done something that differs from other openai experiments that suggest code sequences that I recall seeing, where significant chunks of code from Stack Overflow or similar sites were appearing in answers.
How are you going to prove it was the AI that generated the GPL licensed function ad verbatim from another project, rather than you just opening that project and copying the function yourself?
Synthesising material from various sources isn't copyright infringement, that's called writing.
It's only infringement if the portion copied is significant either absolutely or relatively. A line here or there of the millions in the Linux kernel is okay. A couple of lines of a haiku is not. Copyright is not leprosy.
We all don't have Google resources. What if someone comes after us individually because some model-generated code is near identical to code in a GPL codebase? Where is the liability here?
> What is my responsibility when I accept GitHub Copilot suggestions?
> You are responsible for the content you create with the assistance of GitHub Copilot. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.
We are all vulnerable to predatory lawyer trolls, whether we do things correctly or not. If you are accused of reusing a GPL code, then you ask clarification on which and you rewrite. It is likely to be just a snippet. I doubt Copilot would write a whole lib by copying it from another project.
And yes, of course github is not going to take responsibility for things you do with their tools.
If you learn programming from Stack Overflow and Github, and then repeat something that you learned over your time at reading, that's not just copying text. That's having learned the text. You could say the human brain is mashing together several different code examples, taking some text from each.
Wouldn't that imply that a person who learned to code on GPLv2 sources wrote writes some more code in that style (including "long strings of code", some of which are clearly not unique to GPL) is writing code that is "born GPLv2"?
My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.
Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.
The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.
I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.
Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.
I seem to remember a similar discussion on Intellicode (similar thing, but more like Intellisense, and as Visual Studio plugin), which is trained on "github projects with more than 100 stars". IFIR they check the LICENSE.txt file in the project and ignore projects with an "incompatible" license. I don't have any links handy which would confirm this though.
My guess would be that the model itself (and the training process) could have different legal requirements compared to the code it generates. The code generated by the model is probably sufficiently transformative new work that wouldn't be GPL (it's "fair use").
I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'
No, the code generated is what copyright law calls a derivative work and you should go ask Robin Thicke and Pharrell Williams exactly how much slack the courts give for 'sufficiently transformative new work.
My bet is that copyright law has not caught up with massive machine learning models that partially encode the training data, and that there will still be cases to set legal precedent for machine learning models.
Note also that it's not just a concern for copyright, but also privacy. If the training data is private, but the model can "recite" (reproduce) some of the input given an appropriate query, then it's a matter of finding the right adversarial inputs to reconstruct some training data. There are many papers on this topic.
It is almost certainly the case that current IP law is very unsettled when it comes to machine learning models and mechanisms that encode a particular training set into the output or mechanism for input transformation. What should probably scare the shit out of people looking to commercialize this sort of ML is that the most readily available precedents for the courts to look at are from the music industry, and some of the outcomes have truly been wacky IMHO. The 'blurred lines' case is the one that should keep tech lawyers up at night, because if something like that gets applied to ML models the entire industry is in for a world of pain.
You're missing the fair use aspects. Check out this article on fair use [0].
> In 1994, the U.S. Supreme Court reviewed a case involving a rap group, 2 Live Crew, in the case Campbell v. Acuff-Rose Music, 510 U.S. 569 (1994)... It focused on one of the four fair use factors, the purpose and character of the use, and emphasized that the most important aspect of the fair use analysis was whether the purpose and character of the use was "transformative."
There are far more current precedents that apply here, and they do not trend in Github's favor -- as I noted previously, Williams v. Gaye (9th Cir. 2017) is going to be very interesting in this case. I am sure several people in Microsoft's legal department set parameters on the model training and that they felt that they were standing on solid ground, but I am also sure that there are a few associate professors in various law schools around the country who are salivating at the opportunity to take a run against this and make a name for themselves.
But possibly things that were spit out verbatim from the training set, which the FAQ mentions does happen about .1% of the time [1]. Another comment in this thread indicated that the model outputs something that's verbatim usable about 10% of the time. So, taking those two numbers together, if you're using a whole generated function verbatim, a bit of caveat emptor re: licensing might not be the worst idea. At least until the origin tracker mentioned in the FAQ becomes available.
[2] "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."
I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.
Certainly not. If I memorize a line of copyrighted code and then write it down in a different project, I have copied it. If an ML model does the same thing as my brain - memorizing a line of code and writing it down elsewhere - it has also copied it. In neither case is that "fair use".
2) if I write a program that copies parts of other GPL licensed SW into my proprietary code, does that absolve me of GPL if the copying algorithm is complicated enough?
My interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make incedental copies regardless of what your license states.
IMO the closest case is probably the students suing turnitin a number of years ago, which iParadigms (the turnitin maker) won [1].
I think this is definitely a gray area and in some way iParadigms winning (compared to all the cases decided in favour of e.g. the music industry), shows the different yardsticks being used for individuals and companies.
IANAL, but my interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make copies regardless of what your license states. This is the same reason the GitHub search index isn’t GPL contaminated.
Developers Human brain are also trained with propietary code bases, then when they quit and go elswere , they program using knowledge learned previously, yet you do not sue them.
We kinda have to accept that - we don't have to accept this. You can't interface with humans but you can interface with one of the biggest corporate tech giants straight up leeching from explicitly public and free work for their own private benefit.
Great new feature! Such impact! Happy perf!