My guess would be that the model itself (and the training process) could have different legal requirements compared to the code it generates. The code generated by the model is probably sufficiently transformative new work that wouldn't be GPL (it's "fair use").
I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'
No, the code generated is what copyright law calls a derivative work and you should go ask Robin Thicke and Pharrell Williams exactly how much slack the courts give for 'sufficiently transformative new work.
My bet is that copyright law has not caught up with massive machine learning models that partially encode the training data, and that there will still be cases to set legal precedent for machine learning models.
Note also that it's not just a concern for copyright, but also privacy. If the training data is private, but the model can "recite" (reproduce) some of the input given an appropriate query, then it's a matter of finding the right adversarial inputs to reconstruct some training data. There are many papers on this topic.
It is almost certainly the case that current IP law is very unsettled when it comes to machine learning models and mechanisms that encode a particular training set into the output or mechanism for input transformation. What should probably scare the shit out of people looking to commercialize this sort of ML is that the most readily available precedents for the courts to look at are from the music industry, and some of the outcomes have truly been wacky IMHO. The 'blurred lines' case is the one that should keep tech lawyers up at night, because if something like that gets applied to ML models the entire industry is in for a world of pain.
You're missing the fair use aspects. Check out this article on fair use [0].
> In 1994, the U.S. Supreme Court reviewed a case involving a rap group, 2 Live Crew, in the case Campbell v. Acuff-Rose Music, 510 U.S. 569 (1994)... It focused on one of the four fair use factors, the purpose and character of the use, and emphasized that the most important aspect of the fair use analysis was whether the purpose and character of the use was "transformative."
There are far more current precedents that apply here, and they do not trend in Github's favor -- as I noted previously, Williams v. Gaye (9th Cir. 2017) is going to be very interesting in this case. I am sure several people in Microsoft's legal department set parameters on the model training and that they felt that they were standing on solid ground, but I am also sure that there are a few associate professors in various law schools around the country who are salivating at the opportunity to take a run against this and make a name for themselves.
I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'