It would be great if some of these datasets were free and opened up for public use. Otherwise it seems like you end up duplicating a lot of busywork just for multiple companies to farm more money. Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.
Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.
Same reason they give open source contributions away for free. Hardware companies attempting to commoditize their complement. I think the org best placed to get strategic advantage from releasing high quality data sets might be Nvidia.
If you vote for enough people of a certain party yes, we all currently see how quickly things can change in this country if those in power want change to happen.
> operating from a framework where the dataset is part of your moat
Very much this. Its the dataset that shapes the model, the model is a product of the dataset, rather than the other way around (mind you, synthetic datasets are different...)
Right, and they pay a lot of money for this data. I know someone who does this, and one prompt evaluation could go through multiple rounds and reviews that could end up generating $150+ in payouts, and that's just what the workers receive. But that's not quite what the article is talking about. Each of these companies do things a bit different.
> Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.
The EU has started the process of opening discussions aiming to set the stage for opportunities to arise on facilitating talks looking forward to identify key strategies of initiating cooperation between member states that will enable vast and encompassing meetings generating avenues of reaching top level multi-lateral accords on passing legislation covering the process of processing processes while preparing for the moment when such processes will become processable in the process of processing such processes.
Don’t worry - the labs will train based on this expert data and then everyone will just distill their models. Or, now that model itself can be an expert annotater.
I think it would be difficult to make that work, because Wikipedia has a direct way of converting users into contributors: you see something wrong, you edit the article, it's not wrong anymore.
Whereas if you do the same with machine learning training data, the influence is much more indirect and you may have to add a lot of data to fix one particular case, which is not very motivating.
Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.