Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://commoncrawl.org/

This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).

Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.



The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.

It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.


Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’


Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.

(Cloudflare customer, no other affiliation)


That says that if google switches over to ccbot then the rest will follow.


I mean if it’s created as part of setting the global rules for the internet you could just make it opt out.


Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.


If you have access to archived crawls, anyone can build and serve an index, or model weights (gpt).


Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?


A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.

Bots are typically tuned to work with generic sites over crawling efficiently.


Where is the cost coming from? Wouldn't a crawler mostly just accessing cached static assets served by CDN?

And what do you mean by your search infrastructure? Are you talking about elasticsearch or some equivalent?


No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).

I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.


One problem, it leaves one place to censor.

I agree that each front end should do it, but you can bet it will be a core service.


> The Internet Archive can persist the data for ~$2/GB in perpetuity

No they can't but do you have a source?


https://help.archive.org/help/archive-org-information/ and first hand conversations with their engineering team

> We estimate that permanent storage costs us approximately $2.00US per gigabyte.

https://webservices.archive.org/pages/vault/

> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.

https://blog.dshr.org/2017/08/economic-model-of-long-term-st...


What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?


they could charge data processing costs for reads




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: