Just block all of AWS, Alibaba, GCP and Azure, or throttle them aggressively. If you have clients/customers that need more requests per second then have them provide you with their IPs.
The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.
Exactly. They're renting infrastructure on well-known clouds, not cycling through consumer IPs like yesterday's botnets. Block all web traffic from well-known cloud IPs, and you can keep 99% of the LLM bots away. Alibaba seems to be the most common source of bot traffic on my infrastructure lately, and I also see Huawei Cloud from time to time. Not much AWS, probably because of their high IPv4 pricing.
You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.
“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”
So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).
This matches exactly with what I'm seeing on my own sites too and it's from all over the world, not just China.
(I described my bot woes a few weeks ago at https://news.ycombinator.com/item?id=43208623. The "just block bots!" replies were well-intentioned but naive -- I've still found no signal that works reliably well to distinguish bots from real traffic.)
I saw a fair amount of that kind of behavior, too, mostly around the summer of last year. At some point it dropped off sharply. Over the last few months, at least for the servers I keep an eye on, most of the trouble has been from Chinese cloud IPs.
Either the LLM devs got more funding, or maybe the authorities took down the botnet they were using.
The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.