in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.
By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.
The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.
I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.
I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.
The problem is we have things like
in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.
The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.
I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.
I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.