The main data for those pages is in a column store, so it can sustain many thous...

The main data for those pages is in a column store, so it can sustain many thousand RPS (at least).

The problem is we have things like

  Disallow: /the-search-page
  Disallow: /some-statistics-pages

in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.

By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.

The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.

I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.

I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.