Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site, so googlebot wins because they’re the dominant search engine.
It makes sense to break that out so everyone has access to the same dataset at FRAND pricing.
My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach.
This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).
Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.
The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.
It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.
Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’
Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.
Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.
Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?
A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.
Bots are typically tuned to work with generic sites over crawling efficiently.
No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).
I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.
> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.
What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?
Of all the bad ideas I've heard of where to slice Google to break it up, this... Is actually the best idea.
The indexer, without direct Google influence, is primarily incentivized to play nice with site administrators. This gives them reasons to improve consideration of both network integrity and privacy concerns (though Google has generally been good about these things, I think the damage is done regarding privacy that the brand name is toxic, regardless of the behaviors).
A caching proxy costs you almost nothing and will serve thousands of requests per second on ancient hardware. Actually there's never been a better time in the history of the Internet to have competing search engines since there's never been so much abundance of performance, bandwidth, and software available at historic low prices or for free.
There are so many other bots/scrapers out there that literally return zero that I don’t blame site owners for blocking all bots except googlebot.
Would it be nice if they also allowed altruist-bot or common-crawler-bot? Maybe, but that’s their call and a lot of them have made it on a rational basis.
> that I don’t blame site owners for blocking all bots except googlebot.
I doubt this is happening outside of a few small hobbyist websites where crawler traffic looks significant relative to human traffic. Even among those, it’s so common to move to static hosting with essentially zero cost and/or sign up for free tiers of CDNs that it’s just not worth it outside of edge cases like trying to host public-facing Gitlab instances with large projects.
Even then, the ROI on setting up proper caching and rate limiting far outweighs the ROI on trying to play whack-a-mole with non-Google bots.
Even if someone did go to all the lengths to try to block the majority of bots, I have a really hard time believing they wouldn’t take the extra 10 minutes to look up the other major crawlers and put those on the allow list, too.
This whole argument about sites going to great lengths to block search indexers but then stopping just short of allowing a couple more of the well-known ones feels like mental gymnastics for a situation that doesn’t occur.
> sites going to great lengths to block search indexers
That's not it. They're going to great lengths to block all bot traffic because of abusive and generally incompetent actors chewing through their resources. I'll cite that anubis has made the front page of HN several times within the past couple months. It is far from the first or only solution in that space, merely one of many alternatives to the solutions provided by centralized services such as cloudflare.
Regarding allowlisting the other major crawlers: I've never seen any significant amount of traffic coming from anything but Google or Bing. There's the occasional click from one of the resellers (ecosia, brave search, duckduckgo etc), but that's about it. Yahoo? haven't seen them in ages, except in Japan. Baidu or Yandex? might be relevant if you're in their primary markets, but I've never seen them. Huawei's Petal Search? Apple Search? Nothing. Ahrefs & friends? No need to crawl _my_ website, even if I wanted to use them for competitor analysis.
So practically, there's very little value in allowing those. I usually don't bother blocking them, but if my content wasn't easy to cache, I probably would.
In the past month there were dozens of posts about using proof of work and other methods to defeat crawlers. I don't think most websites tolerate heavy crawling in the era of Vercel/AWS's serverless "per request" and bandwidth billing.
You don't get to tell site owners what to do. The actual facts on the ground are that they're trying to block your bot. It would be nice if they didn't block your bot, but the other, completely unnatural and advertising-driven, monopoly of hosting providers with insane per-request costs makes that impossible until they switch away.
You wouldn't have to make them micropayments, you can pay out once some threshold is reached.
Of course, it would incentivize the sites to make you want to crawl them more, but that might be a good thing. There would be pressure on you to focus on quality over quantity, which would probably be a good thing for your product.
Google search is a monopoly not because of crawling. It's because of the all the data it knows about website stats and user behavior. Original Google idea of ranking based on links doesn't work because it's too easily gamed. You have to know what websites are good based on user preferences and that's where you need to have data. It's impossible to build anything similar to Google without access to large amounts of user data.
Sounds like you're implying that they are using Google Analytics to feed their ranking, but that's much easier to game than links are. User-signals on SERP clicks? There's a niche industry supplying those to SEOs (I've seen it a few times, I haven't seen it have any reliable impact).
> so googlebot wins because they’re the dominant search engine.
I think it's also important to highlight that sites explicitly choose which bots to allow in their robots.txt files, prioritizing Google which reinforces its position as the de-facto monopoly. Even when other bots are technically able to crawl them.
> Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site,
Companies want traffic from any source they can get. They welcome every search engine crawler that comes along because every little exposure translates to incremental chances at revenue or growing audience.
I doubt many people are doing things to allow Googlebot but also ban other search crawlers.
> My heart just wants Google to burn to the ground
I think there’s a lot of that in this thread and it’s opening the door to some mental gymnastics like the above claim about Google being the only crawler allowed to index the internet.
> I doubt many people are doing things to allow Googlebot but also ban other search crawlers.
Sadly this is just not the case.[1][2] Google knows this too so they explicitly crawl from a specific IP range that they publish.[3]
I also know this, because I had a website that blocked any bots outside of that IP range. We had honeypot links (hidden to humans via CSS) that insta-banned any user or bot that clicked/fetched them. User-Agent from curl, wget, or any HTTP lib = insta-ban. Crawling links sequentially across multiple IPs = all banned. Any signal we found that indicated you were not a human using a web browser = ban.
We were listed on Google and never had traffic issues.
Are sites really that averse to having a few more crawlers than they already do? It would seem that it’s only a monopoly insofar as it’s really expensive to do and almost nobody else thinks they can recoup the cost.
We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.
EDIT: Just checked our anubis stats for the last 24h
CHALLENGE: 829,586
DENY: 621,462
ALLOW: 96,810
This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).
Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.
One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.
The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.
> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it
That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.
That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.
> That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis.
Well sure you can. If it's requesting something which is allowed in robots.txt, it's a legitimate request. It's only if it's requesting something that isn't that you have to start trying to decide whether to filter it or not.
What does it matter if they use multiple IP addresses to request only things you would have allowed them to request from a single one?
> If it's requesting something which is allowed in robots.txt, it's a legitimate request.
An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.
Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).
So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.
Why would it be only 0.001% of requests? You can fill your actual pages with links to pages disallowed in robots.txt which are hidden from a human user but visible to a bot scraping the site. Adversarial bots ignoring robots.txt would be following those links everywhere. It could just as easily be 50% of requests and each time it happens, they lose that IP address.
I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.
The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.
A "few" more would be fine - but the sheer scale of the malicious AI training bot crawling that's happening now is enough to cause real availability problems (and expense) for numerous sites.
One web forum I regularly read went through a patch a few months ago where it was unavailable for about 90% of the time due to being hammered by crawlers. It's only up again now because the owner managed to find a way to block them that hasn't yet been circumvented.
So it's easy to see why people would allow googlebot and little else.
It makes sense to break that out so everyone has access to the same dataset at FRAND pricing.
My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach.