Companies already pay a lot of money for datasets to train models on in other sp... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		heavyset_go on Nov 6, 2022 \| parent \| context \| favorite \| on: Microsoft sued for open-source piracy through GitH... Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not. Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets. I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.

cbzbc on Nov 6, 2022 | [–]

Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.

heavyset_go on Nov 6, 2022 | [–]

Commoditized*

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact