> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
Personally, I'd prefer this to be like any other software license. If you want to use my IP for training, you need a license. If I use MIT license or something that lets you use my code however you want, then have at it. If I don't, then you can't just use it because it's public.
Then you'd see a lot more open models. Like a GPL model whose code and weights must be shared because the bulk of the easily accessible training data says it has to be open, or something like that.
I realize, however, that I'm in the minority of the ML community feeling this way, and that it certainly is standard practice to just use data wherever you can get it.
When I referenced their contention on Fair Use, that's not what I was referencing, but instead Github CEO Nat Friedman’s comment in this thread that “In general: (1) training ML systems on public data is fair use”.
> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
Personally, I'd prefer this to be like any other software license. If you want to use my IP for training, you need a license. If I use MIT license or something that lets you use my code however you want, then have at it. If I don't, then you can't just use it because it's public.
Then you'd see a lot more open models. Like a GPL model whose code and weights must be shared because the bulk of the easily accessible training data says it has to be open, or something like that.
I realize, however, that I'm in the minority of the ML community feeling this way, and that it certainly is standard practice to just use data wherever you can get it.