Whereas, my report showed they were breaking copyright *before* the training pro...

Whereas, my report showed they were breaking copyright before the training process. Meta was sued for what I said they'd be sued for, too.

Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.

So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.

Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).

So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?