What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.
bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.
No that hard, took me 4 weekends to build a private search engine with Common Crawl, Wikipedia and HN as a link authority source. Takes about a week to crunch the data on an old Lenovo workstation with 256gb ram and some storage.
Pretty sure Tier 4 should be faster than that. I wonder if the CPU was fully utilized on this benchmark. I did some performance work with Axum a while back and was bitten by Nagle algorithm. Setting TCP_NODELAY pushed the benchmark from 90,000 req/s to 700,000 req/s in a VM on my laptop.
Yeah, let's trash the 50 year old industry standard and let's obfuscate the interface to one of the most performance sensitive part of an application. Hell, lets build multiple obfuscators for each language, each with it's own astonishing behavior that turns pathological once you actually get usage on the app.
I'm not sure about enterprise use cases, but for my project I was able to process 60TB of Common Crawl data with Rust at line rate on a budget server without any issues. Code was easy enough to write and it ran great on second try with resources to spare. I would pick Rust for big data projects again.