Can we not just have a whitelist for allowed crawlers and ban the rest by defaul...

microtonal · 2025-03-20T14:36:23 1742481383

How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.

regularfry · 2025-03-20T13:57:56 1742479076

How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.

what · 2025-03-20T14:37:40 1742481460

Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.

regularfry · 2025-03-20T15:12:31 1742483551

And in that case both systems end up with a situation new entrants can't enter.

usefulcat · 2025-03-20T15:19:56 1742483996

I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.

lacksconfidence · 2025-03-20T15:06:40 1742483200

How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?

KTibow · 2025-03-20T14:34:02 1742481242

We actually can do this already.

https://duckduckgo.com/duckduckgo-help-pages/results/duckduc...

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

prmoustache · 2025-03-20T20:35:23 1742502923

I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.

ATechGuy · 2025-03-20T21:57:27 1742507847

This is scary!

Thorrez · 2025-03-20T14:36:23 1742481383

The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.

nonrandomstring · 2025-03-20T14:15:11 1742480111

This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.

insane_dreamer · 2025-03-20T14:31:09 1742481069

Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

lelanthran · 2025-03-22T07:40:59 1742629259

> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.

__MatrixMan__ · 2025-03-20T15:25:00 1742484300

You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.