Perhaps someone at Github can chime in, but I suspect that open source code datasets (the kind they are trained on) should require relatively permissive licenses in the first place. Perhaps they filter for MIT licenses in Github projects and StackOverflow answers used to train the models?