Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI


How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.


How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.


Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.


And in that case both systems end up with a situation new entrants can't enter.


I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.


How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?



I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.


This is scary!


The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.


This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.


Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)


> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.


You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: