Hacker News new | past | comments | ask | show | jobs | submit login

If you're not walling off your content behind a login that contains terms that you agree to not scraping, then, scraping that site is 100% legal. Robots.txt isn't a legal document.



I frequently respect the wishes of other people without any legal obligation to do so, in business, personal, and anonymous interactions.

I do try to avoid people that use the law as a ceiling for the extension of their courtesy to others, as they are consistently quite terrible people.


>Robots.txt isn't a legal document.

ROBOTS.TXT is an implied license, just like LICENSE.MD or LICENSE.TXT in any GitHub repo. There are decades of precedent that the ROBOTS.TXT file communicates what is and isn't allowed when scaping web content, and that you should check that file before scraping the rest of the site.

Willfully violating a written license provided in a predictable format absolutely is a civil legal violation. If your license says "You cannot use this to train AI", and an AI company scrapes it up and trains an AI on it anyway, even though you did your due diligence to communicate your terms, then you have a legal right to seek damages if you can prove that they are violating your license.

You're basically arguing that no reasonable web scraper would know about ROBOTS.TXT. That's bullshit, this method of web robot control has existed since 1996. It would be like violating the license terms of a GitHub project, and claiming that you didn't know that the LICENSE.MD / LICENSE.TXT file was a license you were expected to follow...


If the industry doesn't self-regulate (ie, following conventional rules and basic human courtesy) ... then it will be regulated by laws.

So let me fix what you said for you:

> Robots.txt isn't a legal document, yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: