Google has made it very difficult to completely block their AI crawling by also using the standard googlebot search crawlers to feed data into their AI overviews and other AI features within Google search. Google says there is a workaround but it also blocks your site from fully indexing in Google search. This also also all covered in the article though.
> What does an AI crawler do different from a search engine (indexing) crawler?
Many people don't want the extra bot traffic hitting their site that comes from AI, especially when AI chat & AI overviews in Google provide such a small amount of traffic in return and that traffic pretty much always has horrendous conversion rates (personally seen across multiple industries).
It doesn't seem like the extra traffic is the issue. People don't want Google's AI from reading and summarizing their data and thus preventing clickthroughs. Why would I click on your site if google did all the work of giving me the answer ahead of time?
Both are an issue. People don't want AI overviews cannibalizing their website traffic. People also don't want AI bots spamming their website with outrageous numbers of requests everyday.
In the specific case of Google would there be any additional traffic that isn't just the normal googlebot? I can't imagine they would bother crawling twice for every site on the internet.
Google-Extended is what is associated with AI crawling, but GoogleBot also crawls to produce AI overviews in addition to indexing your website in Google search.
While the number of crawlers and their overlapping responsibilities makes it difficult to know which ones you can safely block, I should also say that pure AI company bots behave 1000x worse than Google crawlers when it comes to not flooding your site with scraping requests.
this is a problem which needs regulatory action, not one which should be solved by a quasi monopoly forcing it onto anyone but another quasi monopoly which can use their monopoly power to avoid it
require
- respecting robots.txt and similar
- require purpose binding/separation (of the crawler agent, but also the retrieved data) similar to what GDPR does
- require public agent purpose documentation and stable agent identities
- disallow obfuscation of who is crawling what
- do enforce it
and sure making something illegal doesn't prevent anyone from being technically able to do it
but now at lest large companies like Google have to decide weather they want to commit a crime, and the more they obfuscate that they are doing it the more there is prove it was done with a lot of bad faith, i.e. the higher judges can push punitive damages
combine it with internet gateways like CF trying to provide technical enforcement and you might have a good solution
but one quasi monopoly trying to force another to "comply" with their money making scheme (even if it's in the interest of the end user) smells a lot like you can have a winnable case against CF wrt. unfair market practices, monopoly power abuse etc...
I find it wild that you focus on CF being a monopoly here when they are providing tools that help publishers not have all of their content stolen and repurposed. AI companies have been notorious over the last few years for not respecting any directives and spamming sites with requests to scrape all of their data.
There is also nothing stopping other CDN/DNS providers spinning up a similar marketplace to what CF is looking to do now.
I thought we were broadly opposed to regulatory action for a number of reasons, including anti-socialism ideology, dislike of "red tape", and belief that free markets can solve problems.
> What does an AI crawler do different from a search engine (indexing) crawler?
Many people don't want the extra bot traffic hitting their site that comes from AI, especially when AI chat & AI overviews in Google provide such a small amount of traffic in return and that traffic pretty much always has horrendous conversion rates (personally seen across multiple industries).