Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I found it interesting that in the arxiv paper you linked they are talking about an attack, ethics and responsible disclosure.

But when it comes to scraping the entirety of the internet to train such models that's never referred to as an attack.



Scraping the whole web isn't considered an attack because, well, that's just how search engines work. That being said, there are all sorts of norms (e.g. robots.txt) qualifying what kinds of scraping are accepted.

As far as I can tell, AI researchers assumed they could just piggyback on top of those norms to get access to large amounts of training data. The problem is that it's difficult to call copying an attack unless you go full MAFIAA[0]brain and argue that monopoly rents on creative works are the only functional backstop to the 1st Amendment. Hell, even if you do, the EU and Japan[1] both have a statutory copyright exception explicitly legalizing AI training on other people's text. It's not even accepted dogma among copyright holders that this is an attack.

[0] Music And Film Industry Association of America, a fictional industry association purported to be the merger of the MPAA and RIAA announced on April 1st, 2006: http://mafiaa.org/

[1] Yes, the same country whose copyright laws infamously have no Fair Use equivalent. In Japan, it is illegal to review or parody a copyrighted work without a license, but it is legal to train an AI on it.


Alternatively, you can just believe in standard, copyright law which says you need a license to distribute much content. Most file sharing cases ruled in favor of that.

The AI companies have been bundling and distributing copywritten works for pretraining. They do illegal activities just to make the AI’s. That’s before considering them generating the training data or derivative works. So, there’s lots of risk which they’re just ignoring for money.


I don't want to have copyright law as it currently exists. It is a badly-negotiated bargain. The public gets very little out of it, the artists get very little protection out of it, and the only people who win are intermediaries and fraudsters.

Keep in mind, this is the same copyright that gave us Prenda Law, an extortion scheme that bilked millions of dollars in bullshit settlements. Prenda Law would create shell companies that created porn, post it on BitTorrent, then have the shell companies sue anyone who downloaded it. Prenda Law would even post all their ongoing litigation on their website with the express purpose of making sure everyone Googling for your name saw the porn, just to embarrass you into settling faster.

This scheme was remarkably profitable, and only stopped being profitable because Prenda slipped up and revealed the fraud[0]. Still, the amount of fraud you have to commit is very minuscule compared to the settlements you can extract out of people for doing this, and there's been no legal reform to try and cut off these sorts of extortion suits. Prenda isn't even the only entity that tried this; Strike 3 Holdings did the same thing.

[0] If you upload your own content to BitTorrent, the defense could argue that this is implied license. Prenda's shell companies would lie about having uploaded the content themselves.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: