Hacker Newsnew | past | comments | ask | show | jobs | submit | rwaksmunski's commentslogin

I use this crate to process 100s of TB of Common Crawl data, I appreciate the speedups.


What's the reason for using bz2 here? Wouldn't it be faster to do a one off conversion to zstd? It beats bzip2 in every metric at higher compression levels as far as I know.


Common Crawl delivers the data as bz2. Indeed I store intermediate data in zstd with ZFS.


That assumes you're processing the data more than once.


Is this data available as torrents?


Yeah came here to say a 14% speed up in compression is pretty good!


bzip2 (particularly parallel implementations thereof) are already relatively competitive for compression. The decompression time is where it lags behind because lz77 based algorithms can be incredibly fast at decompression.


It's blazingly fast


We've always been at war with eurasia.


Two 30s unskippable video ads during POST


How can you get anything done without keyboard backpressure?


Be sure to read the Appendix, Rust's state machine async implementation is indeed very efficient.


No that hard, took me 4 weekends to build a private search engine with Common Crawl, Wikipedia and HN as a link authority source. Takes about a week to crunch the data on an old Lenovo workstation with 256gb ram and some storage.


Pretty sure Tier 4 should be faster than that. I wonder if the CPU was fully utilized on this benchmark. I did some performance work with Axum a while back and was bitten by Nagle algorithm. Setting TCP_NODELAY pushed the benchmark from 90,000 req/s to 700,000 req/s in a VM on my laptop.


Yeah, let's trash the 50 year old industry standard and let's obfuscate the interface to one of the most performance sensitive part of an application. Hell, lets build multiple obfuscators for each language, each with it's own astonishing behavior that turns pathological once you actually get usage on the app.


You know whats also a 50 year old industry Standard? Assembler. Yet, Nobody writes it any more.


I'm not sure about enterprise use cases, but for my project I was able to process 60TB of Common Crawl data with Rust at line rate on a budget server without any issues. Code was easy enough to write and it ran great on second try with resources to spare. I would pick Rust for big data projects again.


Just let it fade away with dignity.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: