Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GitHub says you don't need to credit GitHub for any of the code suggestions, but since it's trained on public sources of code, anyone have a clue on potential licensing pitfalls?


From the FAQ at the bottom of the project showcase page[0]:

"GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study[1] on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."

[0] https://copilot.github.com/

[1] https://github.co/copilot-research-recitation


(not a lawyer) Copyright issues tend to involve the question of how transformative a work is. This means the code coming out the other end is probably fine. I don't know about the training side, though. Are there license issues in using copyrighted training data without any form of licensing? Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'


(not a lawyer) my interpretation of GPLv2 is that at the very least the model would be licensed under GPL if it was trained on GPL code. The model is a derivative work. Whether the generated code coming out of the model is GPL is trickier. I would lean towards yes but I'm not entirely sure.

I think talking about exact text matches to existing code is a red herring. If you took GPL code and ran it through an obfuscator that changed every byte of the code to new code, that resulting code would be derivative and would need to be licensed under GPL too.

Thank you Microsoft for ushering in a new era of free software.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: