Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I use this crate to process 100s of TB of Common Crawl data, I appreciate the speedups.


What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.


Common Crawl delivers the data as bz2. Indeed I store intermediate data in zstd with ZFS.


That assumes you're processing the data more than once.


Is this data available as torrents?


Yeah came here to say a 14% speed up in compression is pretty good!


bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.


It's blazingly fast




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: