Hacker News new | past | comments | ask | show | jobs | submit login

Is it reasonable to expect companies to redistribute 100TB of copyrighted content they used for their LLM, just on the off-chance someone has a few million laying around and wants to reproduce the model from scratch?



Redistribute? No. Itemize and link to? Yes.

With LLMs, the list doesn't even have to be kept up to date, nor the links alive (though publishing content hashes would go a long way here). It's not like you can get an identical copy of a model built anyway, there's too much randomness at every stage in the process. But, as long as the details of cleanup and training are also open, a list of training material used would suffice - people would fetch parts of it, substitute other parts with equivalents that are open/unlicensed/available, add new sources of their own, and the resulting model should have similar characteristics to the OG one who we could, now, call "open source".


Perhaps that's not reasonable to expect, but Meta apparently kind of did it anyway, if not in a way that helps reproduce their LLM: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...


Actually they did, the entire 15T tokens that were supposedly used for training the llama-3 base models are up on HF as a dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb

It's just not literally labelled so because of obvious reasons.


The RL-only (no SFT) approaches might remove that issue. Problem sets should be smaller (and mechanically creatable) than the entire western corpus.


Would a reference file with filename, size, source and checksum count towards the OSI definition?


For open source model claiming SOTA performance, we could at least check for data leak from its training data.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: