It would be great if some of these datasets were free and opened up for public u...

andy_ppp · 2025-07-23T09:02:58 1753261378

Why would companies paying top dollar to refine and create high quality datasets give them away for free?

murukesh_s · 2025-07-23T09:49:42 1753264182

Well that was the idea of "open"AI isn't it [1]?

[1] https://web.archive.org/web/20190224031626/https://blog.open...

sumedh · 2025-07-23T11:59:17 1753271957

Check the date.

This was published before anyone knew it running an AI company would be very very expensive.

some_random · 2025-07-23T12:53:06 1753275186

I feel like that was by far the most predictable part of running an AI company.

sumedh · 2025-07-23T22:44:35 1753310675

Hindsight is 20/20.

delfinom · 2025-07-23T11:13:59 1753269239

ClosedAI gonna ClosedAI

flir · 2025-07-23T09:38:19 1753263499

Same reason they give open source contributions away for free. Hardware companies attempting to commoditize their complement. I think the org best placed to get strategic advantage from releasing high quality data sets might be Nvidia.

mh- · 2025-07-23T18:16:02 1753294562

Indeed, and they do.

https://huggingface.co/nvidia/datasets?sort=most_rows

charlieyu1 · 2025-07-23T09:42:23 1753263743

There are some good datasets for free though, eg HLE. Although I’m sure if they are marketing gimmicks

azemetre · 2025-07-23T17:57:40 1753293460

Because we can make the government force them to.

andy_ppp · 2025-07-24T14:22:49 1753366969

The US government is going to allow that right?

azemetre · 2025-07-25T14:30:23 1753453823

If you vote for enough people of a certain party yes, we all currently see how quickly things can change in this country if those in power want change to happen.

KaiserPro · 2025-07-23T08:59:34 1753261174

> operating from a framework where the dataset is part of your moat

Very much this. Its the dataset that shapes the model, the model is a product of the dataset, rather than the other way around (mind you, synthetic datasets are different...)

gexla · 2025-07-23T09:07:55 1753261675

Right, and they pay a lot of money for this data. I know someone who does this, and one prompt evaluation could go through multiple rounds and reviews that could end up generating $150+ in payouts, and that's just what the workers receive. But that's not quite what the article is talking about. Each of these companies do things a bit different.

NitpickLawyer · 2025-07-23T10:22:37 1753266157

> Maybe some of the European initiatives related to AI will end up including the creation of more open datasets.

The EU has started the process of opening discussions aiming to set the stage for opportunities to arise on facilitating talks looking forward to identify key strategies of initiating cooperation between member states that will enable vast and encompassing meetings generating avenues of reaching top level multi-lateral accords on passing legislation covering the process of processing processes while preparing for the moment when such processes will become processable in the process of processing such processes.

#justeuthings :)

ripped_britches · 2025-07-23T15:05:25 1753283125

Don’t worry - the labs will train based on this expert data and then everyone will just distill their models. Or, now that model itself can be an expert annotater.

illegalmemory · 2025-07-23T12:32:17 1753273937

This could work with a Wikipedia-like model. It's very difficult to pull off, but a next-generation Wikipedia would look like this.

yorwba · 2025-07-23T20:22:12 1753302132

I think it would be difficult to make that work, because Wikipedia has a direct way of converting users into contributors: you see something wrong, you edit the article, it's not wrong anymore.

Whereas if you do the same with machine learning training data, the influence is much more indirect and you may have to add a lot of data to fix one particular case, which is not very motivating.