Why is https://commoncrawl.org/ not enough?

dewey · on July 10, 2022

Isn't that one more data sets for ML other research purposes instead of a highly up to date search index (For example with news from a few minutes ago).

jka · on July 10, 2022

Roughly speaking, yep - Common Crawl provides a sizable chunk of web data (420 TiB uncompressed, over 3 billion unique URLs, as of May 2022; historic statistics here[1]), and is updated on monthly basis. Not near-real-time, true, albeit relatively fresh.

A question to ask could be: how often do users care about information from a few minutes ago, compared to information that has been available for a longer duration of time?

[1] - https://commoncrawl.github.io/cc-crawl-statistics/

flexie · on July 10, 2022

Isn't that more a question of adding to the mix frequent scraping of

- a few thousand news-sites (like nyt.com, bbc.co.uk),

- a few thousand very popular blogs (based on what influencers people search for),

- a handful of social media sites (e.g. Twitter),

- a few hundred databases in areas like weather, airlines, sports (like ATP for people who look for Wimbledon results today)?

sudodude · on July 10, 2022

I mean, any time someone wants information on current or recent events is your use case right there. If you exclude news entirely, you could maybe disregard recent websites but I imagine that's statistically a pretty large portion of search.