The crawler companies just do not give a shit. They're running these crawl jobs because they can, the methodology is worthless, the data will be worthless, but they have so much computing resources relative to developer resources that it costs them more to figure out that the crawl is worthless and figure out what isn't worthless, than it does to just do the crawl and throw away the worthless data at the end and then crawl again. Meanwhile they perform the internet's most widespread DDoS (which is against the CFAA btw so if they caused actual damages to you, try suing them). I don't personally take an issue with web crawling as a concept (how else would search engines work? oh, they don't work any more anyway) but the implementation is obviously a failure.
---
I've noticed one crawling my copy of Gitea for the last few months - fetching every combination of https://server/commithash/filepath. My server isn't overloaded by this. It filled up the disk space by generating every possible snapshot, but I count that as a bug in Gitea, not an attack by the crawler. Still, the situation is very dumb, so I set my reverse proxy to feed it a variant of the Wikipedia home page on every AI crawler request for the last few days. The variation has several sections replaced with nonsense, both AI-generated and not. You can see it here: https://git.immibis.com/gptblock.html
I just checked, and they're still crawling, and they've gone 3 layers deep into the image tags of the page. Since every URL returns that page if you have the wrong user-agent, so do the images, but they happen to be in a relative path so I know how many layers deep they're looking.
Interestingly, if you ask ChatGPT to evaluate this page (GPT interactive page fetches are not blocked) it says it's a fake Wikipedia. You'd think they could use their own technology to evaluate pages.
---
nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs
---
I've noticed one crawling my copy of Gitea for the last few months - fetching every combination of https://server/commithash/filepath. My server isn't overloaded by this. It filled up the disk space by generating every possible snapshot, but I count that as a bug in Gitea, not an attack by the crawler. Still, the situation is very dumb, so I set my reverse proxy to feed it a variant of the Wikipedia home page on every AI crawler request for the last few days. The variation has several sections replaced with nonsense, both AI-generated and not. You can see it here: https://git.immibis.com/gptblock.html
I just checked, and they're still crawling, and they've gone 3 layers deep into the image tags of the page. Since every URL returns that page if you have the wrong user-agent, so do the images, but they happen to be in a relative path so I know how many layers deep they're looking.
Interestingly, if you ask ChatGPT to evaluate this page (GPT interactive page fetches are not blocked) it says it's a fake Wikipedia. You'd think they could use their own technology to evaluate pages.
---
nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs
location = /gptblock.html {root /var/www/html;}
if ($http_user_agent ~* "https://openai.com/gptbot") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "claudebot@anthropic.com") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "https://developer.amazon.com/support/amazonbot") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "GoogleOther") {rewrite ^.*$ /gptblock.html last; break;}*