Isn't that one more data sets for ML other research purposes instead of a highly up to date search index (For example with news from a few minutes ago).
Roughly speaking, yep - Common Crawl provides a sizable chunk of web data (420 TiB uncompressed, over 3 billion unique URLs, as of May 2022; historic statistics here[1]), and is updated on monthly basis. Not near-real-time, true, albeit relatively fresh.
A question to ask could be: how often do users care about information from a few minutes ago, compared to information that has been available for a longer duration of time?
I mean, any time someone wants information on current or recent events is your use case right there. If you exclude news entirely, you could maybe disregard recent websites but I imagine that's statistically a pretty large portion of search.