Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If the format is splittable you generally can get similar benefits, and parquet files have metadata to point a given reader at a specific chunk of the file that can be read independently. In the case of parquet the writer decides when to finish writing a block/RowGroup, so manually creating smaller files than that can increase parallelism. But you can only go so far as I'm pretty sure I've seen spark combine together very small files into a single threaded read task.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: