This isn't the same situation at all. The copying of code doesn't seem to be for...

zarzavat · on June 29, 2021

They are not replicating the code at all. They are training a neural network. The neural network then learns from the code and synthesises new code.

It's no different from a human programmer reading code, learning from it, and using that experience to write new code. Somewhere in your head there is code that someone else wrote. And it's not infringing anybody's copyright for those memories to exist in your head.

jaimeyap · on June 29, 2021

We can't yet equivocate ML systems with human beings. Maybe one day. But at the moment, it's probably better to compare this to a compiler being fed licensed code. The compilation output is still subject to the license. Regardless of how fancy the compiler is.

Also, a human being that reproduces licensed code from memory - because they read that code - would constitute a license violation. The line between derivative work, and authentic new original creation is not a well defined one. This is why we still have human arbiters of these decisions and not formal differential definitions of it. This happens in music for example all the time.

michaelpb · on June 30, 2021

If avoiding copyright violations was simply "I remembered it", then I don't think things like clean-room reverse engineering would be ever legally necessary [1]

[1] https://en.wikipedia.org/wiki/Clean_room_design

IncRnd · on June 29, 2021

It is replication, maybe not of a single piece of code - but creating a synthesis is still copying. For example, constructing a single piece of code of three pieces of code from your co-workers is still replication of code.

Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.

caconym_ · on June 30, 2021

> Your argument would have some merit if something were created instead of assembled, but there is no new algorithm that is being created. That is not what is happening here.

If you're going to set such a high standard for ML tools like this, I think you need to justify why it shouldn't apply to humans too.

When a human programmer who has read copyrighted code at some point in their life writes new code that is not a "new algorithm", are they in violation of the copyrights of every piece of code they've ever read that was remotely similar in any respect to the new work?

I mean, I hope not!

> On the one hand, you call this copying in fair use. On the other hand, you say this is creating new code. You can't have it both ways.

I'm not a lawyer, but this actually sounds very close to the "transformative" criterion under fair use. Elements of existing code in the training set are being synthesized into new code for a new application.

I assume there's no off-the-shelf precedent for this, but given the similarity with how human programmers learn and apply knowledge, it doesn't seem crazy to think this might be ruled as legitimate fair use. I'd guess it would come down to how willing the ML system is to suggest snippets that are both verbatim and highly non-generic.

IncRnd · on June 30, 2021

From https://docs.github.com/en/github/copilot/research-recitatio...: "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License."

On the same page is an image showing copilot in real-time adding the text of the famous python poem, The Zen of Python. See https://docs.github.com/assets/images/help/copilot/resources... for a link directly to copilot doing this.

You are making arguments about what you read instead of objectively observing how copilot operates. Just because GH wrote that copilot synthesizes new code doesn't mean that it writes new code in the way that a human writes code. That is not what is happening here. It is replicating code. Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.

caconym_ · on June 30, 2021

> You are making arguments about what you read instead of objectively observing how copilot operates.

Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

> But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

The arguments you've made here would seem to apply equally well to a version of Copilot hardened against "recitation", hence my reply.

> Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.

It would be convenient for your argument(s) if it were decided legal fact that ML-synthesized code is derivative work, but it seems far from obvious to me (in fact, I would disagree) and you haven't articulated a real argument to that effect yourself. It has also definitely not been decided by any legal entity capable of establishing precedent.

And, again, if this is what you believe then I'm not sure how the work of human programmers is supposed to be any different in the eyes of copyright law.

IncRnd · on June 30, 2021

> Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):

No. We both aren't. I am discussing how copilot operates from the perspective of a user concerned about legal ramifications. I backed that concern up with specific factual quotes and animated images from github, where github unequivocally demonstrated how copilot copies code. You are speculating how copyright law should handle ML code synthesis.

caconym_ · on July 1, 2021

> No. We both aren't

You say I'm not ... but then you say, explicitly in so many words, that I am:

> You are speculating how copyright law should handle ML code synthesis.

I don't get it. Am I, or aren't I? Which is it? I mean, not that you get to tell me what I am talking about, but it seems like something we should get cleared up.

edit: Maybe you mean I am, and you aren't?

Beyond that, I skimmed the Github link, and my takeaway was that this is a small problem (statistically, in terms of occurrence rate) that they have concrete approaches to fixing before full launch. I never disputed that "recitation" is currently an issue, but honestly that link seems to back up my position more than it does yours (to the extent that yours is coherent, which (as above) I would dispute).

IncRnd · on July 5, 2021

> They are not replicating the code at all.

Now that five days have passed, there have been a number of examples of copilot doing just that, replicating code. Quake source code that even included comments, the famous python poem, etc. There are many examples of code that has been replicated - not synthesized but duplicated byte for byte from the originals.

blibble · on June 29, 2021

surely that depends on the size of the training set?

I could feed the Linux kernel one function at a time into a ML model, then coerce its output to be exactly the same as the input

this is obviously copyright infringement

whereas in the github case where they've trained it on millions of projects maybe it isn't?

does the training set size become relevant legally?