Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.



Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.


Commoditized*




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: