Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.
Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.
I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.
Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.
I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.