Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a ...

menaerus · 2025-07-16T06:42:43 1752648163

Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.

zX41ZdbW · 2025-07-23T19:59:26 1753300766

Distributed across 20 replicas.