> Like every time someone streams a song on sportify, the artist gets a few pennies
You're thinking of broadcast radio. Streaming is a fraction of a penny!
> So should they be paid a few pennies every time the LLM spits out a response that "used" that training data?
That doesn't sound unreasonable! I understand that current LLMs have no way to report "this token came from this data" but that doesn't mean it's impossible to build. (Ack: this is a full-on proper dunning-kruger, having not looked into it & having zero knowledge of the field.)
But I think it's probably more reasonable to simply split a % of the service revenue across everyone whose data was used. Or pay an up-fee for ingesting the data in the first place.
Generally, it's crazy to name that some people think it's reasonable for these companies to pay for GPUs and CPUs and electricity to run them but not for the data that's the actual core of their service.
So should they be paid a few pennies every time the LLM spits out a response that "used" that training data?
And I am pretty sure it's not even possible to really link the output back to training data anyways.