Is it reasonable to expect companies to redistribute 100TB of copyrighted conten...

TeMPOraL · 2025-02-10T10:55:52 1739184952

Redistribute? No. Itemize and link to? Yes.

With LLMs, the list doesn't even have to be kept up to date, nor the links alive (though publishing content hashes would go a long way here). It's not like you can get an identical copy of a model built anyway, there's too much randomness at every stage in the process. But, as long as the details of cleanup and training are also open, a list of training material used would suffice - people would fetch parts of it, substitute other parts with equivalents that are open/unlicensed/available, add new sources of their own, and the resulting model should have similar characteristics to the OG one who we could, now, call "open source".

atq2119 · 2025-02-10T09:36:18 1739180178

Perhaps that's not reasonable to expect, but Meta apparently kind of did it anyway, if not in a way that helps reproduce their LLM: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...

moffkalast · 2025-02-10T10:18:07 1739182687

Actually they did, the entire 15T tokens that were supposedly used for training the llama-3 base models are up on HF as a dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb

It's just not literally labelled so because of obvious reasons.

anotherhue · 2025-02-10T08:15:29 1739175329

The RL-only (no SFT) approaches might remove that issue. Problem sets should be smaller (and mechanically creatable) than the entire western corpus.

prisenco · 2025-02-10T08:40:45 1739176845

Would a reference file with filename, size, source and checksum count towards the OSI definition?

puapuapuq · 2025-02-10T08:34:16 1739176456

For open source model claiming SOTA performance, we could at least check for data leak from its training data.