Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.



I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.


Until broken by the Butlerian Jihad, "Though shalt not make a machine in the likeness of the mind of man."


I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.


Anything we humans deem private in nature from other humans.



If the logistic driving parameter is large enough it can also lead to complete chaos.


IMO this was one of the real motives for Web Environment Integrity. Allow Google to index but nobody else.

We're kind of stuck between a rock and a hard place here. Which do you prefer, entrenched incumbents or affordable/open hosting?


I’m supremely confident that attestation will arrive in one form or another in the near future.

Anonymous browsing and potentially-malicious bots look identical. This was sort of OK up until now.


Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.

If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.


does indeed sound like we're headed right back to AOL. At least this time it'll be faster? Certainly won't be as charming.


Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.


It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.


> The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

AI companies with best anti-captcha mechanics will win and will inject ads to LLM output in more sophisticated way.


This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.

OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.

I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

Example 'search' project: https://rumca-js.github.io/search


The stated problem was about indexing, accessing content and advertising in that context.

> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.

> Example 'search' project: https://rumca-js.github.io/search

That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.


What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?

Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?


Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI


How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.


How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.


Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.


And in that case both systems end up with a situation new entrants can't enter.


I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.


How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?



I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.


This is scary!


The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.


This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.


Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)


> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.


You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.


AI is good at solving captchas. But even if everyone added a captcha search engines will continue indexing. Because it is easy to add authentication for search engines to escape captcha, Google will just need to publish a public key.


This is fine, as Google's utility as a search engine has turned into a hot pile of garbage, at least for my cases. Where a decade ago I could put in a few keywords and get relevant results, I now have to guide it with several "quoted phrases" and -exclusions to get the result I'm looking for on the second or third result page. It has crumbled under its own weight, and seems to suggest irrelevant trash to me first and foremost because it's the website of some big player or content farm. Either their algorithm is tuned for mass manipulation or they lost the arms race with SEO cretins (or both).

Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: