https://commoncrawl.org/ This is similar to the natural monopoly of root DNS ser...

mullingitover · 2025-05-10T17:52:58 1746899578

The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.

It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.

xp84 · 2025-05-10T18:02:36 1746900156

Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’

toomuchtodo · 2025-05-10T18:35:54 1746902154

Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.

(Cloudflare customer, no other affiliation)

kzrdude · 2025-05-10T19:23:47 1746905027

That says that if google switches over to ccbot then the rest will follow.

CPLX · 2025-05-10T18:00:23 1746900023

I mean if it’s created as part of setting the global rules for the internet you could just make it opt out.

sanderjd · 2025-05-10T17:34:04 1746898444

Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.

toomuchtodo · 2025-05-10T17:42:08 1746898928

If you have access to archived crawls, anyone can build and serve an index, or model weights (gpt).

fallingknife · 2025-05-10T19:56:47 1746907007

Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?

everforward · 2025-05-10T20:03:54 1746907434

A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.

Bots are typically tuned to work with generic sites over crawling efficiently.

fallingknife · 2025-05-10T20:13:02 1746907982

Where is the cost coming from? Wouldn't a crawler mostly just accessing cached static assets served by CDN?

And what do you mean by your search infrastructure? Are you talking about elasticsearch or some equivalent?

everforward · 2025-05-10T21:49:47 1746913787

No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).

I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.

bbarnett · 2025-05-10T17:29:09 1746898149

One problem, it leaves one place to censor.

I agree that each front end should do it, but you can bet it will be a core service.

vasco · 2025-05-10T17:27:21 1746898041

> The Internet Archive can persist the data for ~$2/GB in perpetuity

No they can't but do you have a source?

toomuchtodo · 2025-05-10T17:30:36 1746898236

https://help.archive.org/help/archive-org-information/ and first hand conversations with their engineering team

> We estimate that permanent storage costs us approximately $2.00US per gigabyte.

https://webservices.archive.org/pages/vault/

> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.

https://blog.dshr.org/2017/08/economic-model-of-long-term-st...

dmoy · 2025-05-10T17:56:07 1746899767

What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?

adgjlsfhk1 · 2025-05-10T20:34:40 1746909280

they could charge data processing costs for reads