Hacker News new | past | comments | ask | show | jobs | submit login

> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

How do they know that these are LLM crawlers and not anything else?




As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.


> IPs come from companies that obviously work with LLM technology

Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

> is that it's 100s of IPs doing 1 request

Are all of those IPs within the same ranges or scattered?

Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.


> Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?

Those are the ones that make it obvious, yes. It's not exclusive, though, but enough to connect the dots.

> Are all of those IPs within the same ranges or scattered?

The IP ranges are all over the place. Alibaba seems to have tons of small ASNs, for instance.


> How do they know that these are LLM crawlers and not anything else?

I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.


That was my reaction. It seems like the article is saying two mutually exclusive things:

- We cannot block them because we can’t differentiate legitimate traffic from illegitimate traffic…

- …but we can conclusively identify this traffic as coming from AI crawlers.


It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.


It's the behavior of the traffic in hindsight that's obvious. It's difficult to identify in the moment. This is by design.

Getting caught isn't a big deal. Getting caught in the act is. As long as they get their data, it doesn't matter if they're caught afterwards.


In my case, no small fraction of the traffic was from OpenAI and Anthropic. There were also other user agents that literally said "AI".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: