Is it reasonable to expect companies to redistribute 100TB of copyrighted content they used for their LLM, just on the off-chance someone has a few million laying around and wants to reproduce the model from scratch?
With LLMs, the list doesn't even have to be kept up to date, nor the links alive (though publishing content hashes would go a long way here). It's not like you can get an identical copy of a model built anyway, there's too much randomness at every stage in the process. But, as long as the details of cleanup and training are also open, a list of training material used would suffice - people would fetch parts of it, substitute other parts with equivalents that are open/unlicensed/available, add new sources of their own, and the resulting model should have similar characteristics to the OG one who we could, now, call "open source".