They allow a major component of the model, the data, to be withheld.

pabs3 · 2025-04-26T02:09:47 1745633387

Not only withheld, but also completely proprietary, not modifiable nor redistributable.

nofriend · 2025-04-26T02:12:25 1745633545

Nobody owns their data. They just scrape the internet, or pirate massive troves of books. Just forcing companies to get a license to all the data they use, let alone an open license, would be a massive impediment to the development of open models.

pabs3 · 2025-04-26T02:23:51 1745634231

It is definitely doable to get openly licensed data, you just have to do it via voluntary participation of crowdsourced data acquisition programs. For example the RNNoise model was retrained from such crowdsourced data.

tedivm · 2025-04-26T13:12:11 1745673131

IBM did it with their Granite models.

pabs3 · 2025-04-26T15:16:59 1745680619

The data used for training Granite doesn't sound like it would be under FOSS licenses.

https://en.wikipedia.org/wiki/IBM_Granite