As someone who has to deal with a lot of bots, bot networks and other weird scraper apps people use: The biggest issue is that most of these tools are not behaved very well. This tool is clearly designed to circumvent protections, rate limits mostly, against scraping that might be essential to keep things running.
They follow links that are explicitly marked as do not follow, they do not even try to limit their rate, they spoof their user agent strings etc etc. These bots cause real problems and cost real money. I do not think that kind of misuse is ethical. In fact, using this tool to circumvent protections can turn your scraping into a DDOS attack, which I do not feel are ethical.
If your bot behaves itself though, public information is public imo. Just don't take down websites, respect rate limits and do not follow 'no-follow' links.
To give an idea of the size of the issue, we have websites for customers that have maybe 5 hits per minute from actual users. Then _suddenly_ you go to 500 hits/minute for a couple of hours, because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever. (Not the greatest software that these links are still there tbh, but that is out of my control.)
Or another situation, not entirely related, but interesting i think:
A few years back for days on end 80% of our total traffic came from random IPs across china & request could be traced through HTTP referrers where the 'user' had apparently opened a page in one province, then traveled to the other side of China and clicked a link 2 hours later.
All these things are relatively easy to mitigate, but that doesn't make it ethical.
Just so you know, adding rel="nofollow" has never been intended to prevent bots from following those links. Even famous crawlers like Bingbot will sometimes follow, and index pages linked to by "nofollow" links.
The only thing that rel="nofollow" does is tell search engines not to use that link in their PageRank computation.
If you do want to block well-behaved crawlers from crawling parts of your site, the proper way to do that is to use robots.txt rules.
Exactly. We'd get customers thats sites would drown in bot traffic because they pulled the same shit, changing user agents, different ips, etc. I had to build custom mod security rules to block the patterns these boys would pull. What's funny is that the bots would have a site where you could control the crawl rates but it's just a placebo. They would crawl even if you requested them to stop.
The issue with bots hosted on AWS or any cloud for that matter is that as a web host you can't just block the IPs because legitimate traffic comes from them in the form of CMS plugins, backups, etc.
"... because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever."
Been there, done that - at least on the side of fixing it. Anyone who implements a calendar, don't make pervious and next links that let someone travel time forever.
I've always wondered how much bit traffic costs us - but never actually tried to figure it out. It is a good portion of our traffic - even when we block a lot of it.
I think it depends on the industry. I've worked at a few ecommerce companies and they all either bought data collected with scrapers or had a scraper team. They also paid for bot protection on their website to stop other companies from scraping their data. I have no problem with this, as ultimately consumers get more competitive and usually lower prices. That is assuming that they are scraping a reasonable rate and not sending 500 requests a second.
What does concern me is the other uses that scraping tools have. For example, what's to stop me from writing a bot specifically to fuck with a competitor's analytics and a/b testing?
It doesn't really matter what's ethical and not, or what the site's wish - what matters is what the law says [1]. I don't want my neighbours smoking on the balcony below me, as I can smell their smoke but the law doesn't allow my building to ban it without consensus. Alas...
> I don't want my neighbours smoking on the balcony below me, as I can smell their smoke but the law doesn't allow my building to ban it without consensus.
That's simply because you're in multi-family housing, it would be a different story if the neighbors smoked and put up a fan to blow the smoke into your yard/window/etc, and that's probably a more apt comparison to bots that literally fork bomb themselves to crawl your site at 500 requests per second.
Why can't we talk about ethics? There are a lot of things in this world that are legal, yet I choose not to do them because they are not ethical in my opinion.
We are allowed to talk about ethics separate from the law.