This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).
Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.
The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.
It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.
Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’
Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.
Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.
Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?
A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.
Bots are typically tuned to work with generic sites over crawling efficiently.
No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).
I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.
> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.
What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?
This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).
Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.