> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.
How do they know that these are LLM crawlers and not anything else?
As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.
> IPs come from companies that obviously work with LLM technology
Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?
> is that it's 100s of IPs doing 1 request
Are all of those IPs within the same ranges or scattered?
Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.
> How do they know that these are LLM crawlers and not anything else?
I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.
It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.
How do they know that these are LLM crawlers and not anything else?