Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Doing basic copyright analyses on model outputs is all that is needed. Check if the output contains copyright, block it if it does.

Transformers aren't zettabyte sized archives with a smart searching algo, running around the web stuffing everything they can into their datacenter sized storage. They are typically a few dozen GB in size, if that. They don't copy data, they move vectors in a high dimensional space based on data.

Sometimes (note: sometimes) they can recreate copyrighted work, never perfectly, but close enough to raise alarm and in a way that a court would rule as violation of copyright. Thankfully though we have a simple fix for this developed over the 30 years of people sharing content on the internet: automatic copyright filters.



It's not even close to that simple. Nobody is really questioning if the data contains the copyrighted information, we know that to be true in enough cases to bankrupt open ai, the question is what analogy should the courts be using as a basis to determine if it's infringement.

It read many works but can't duplicate them exactly sounds a lot like what I've done, to be honest. I can give you a few memorable lines to a few songs but only really can come close to reciting my favorites completely. The LLMs are similar but their favorites are the favorites of the training data. A line in a pop song mentioned a billion times is likely reproducible, the lyrics to the next track on the album, not so much.

IMO, any infringement that might have happened would be acquiring data in the first place but copy protection cares more about illegal reproduction than illegal acquisition.


You're correct, as long as you include the understanding that "reproduction" also encompasses "sufficiently similar derivative works."

Fair use provides exceptions for some such works, but not all, and it is possible for generative models to produce clearly infringing (on either copyright or trademark basis) outputs both deliberately (IMO this is the responsibility of the user) and, much less commonly, inadvertently ( ?).

This is likely to be a problem even if you (reasonably) assume that the generative models themselves are not infringing derivative works.


No comment on if output analysis is all that is needed, though it makes sense to me. Just wanted to note that using file size differences as an argument may simply imply transformers could be a form of (either very lossy or very efficient) compression.


You can argue any form of data is an arbitrarily lossy compression of any other form of data.

I get your point, but nobody is archiving their companies 50 years of R&D data with and LLM so they can get it down to 10GB.

They may have traits of data compression, but they are not at all in the class of data compression software.


So then copyrighted content scraped is not needed for training? Guess I missed AGI suddenly appearing that reasoned things out all by itself.


Nothing builds a better strawman than a foundation started with "So".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: