In the history of media law I’ve seen judged lean into whatever interpretation balances the ecosystem more than what is “literally the law”. The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can’t just be “haha fuck humanity love, openAI”
I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".
Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.
How do YOU plan on compensating those whose labor helped you? I bet you don’t. Same thing you are just imagining being David rather than Goliath makes it ok for you.
It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.
Ok, what about the open source and research models? I wouldn’t wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what’s a pretty new technology (at least in its current use)
Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.