Hacker News new | past | comments | ask | show | jobs | submit login

I guess it depends on what we mean by "throwing hardware at it."

GPUs aren't magic. You still need to come up with a parallelizable algorithm.

The TL;DR is that the fastest solutions are basically map/reduce with a bunch of microoptimizations for parsing each line.

But before you do that, you need to divide up the work. You can't just give each core `file_size_bytes/core_count` chunks of the file because those chunks won't align with the line breaks. So, you need to be clever about that part somehow.

Once you've done that, you have a nice map/reduce that should scale linearly up to at least 20 or 30 cores. So in that sense, you can "throw hardware at it."

Whether or not any of that is a good fit for GPU acceleration, I don't know.

You should try the challenge. It's trickier than you think but surprisingly fun.




Indeed!

You may enjoy this talk where I do just that... end-to-end on GPUs, and < 100loc Python: https://www.youtube.com/watch?v=8ZMzsTbfImU

Your intuition about mapping to kernels is good. Basically all SQL, Polars, DuckDB, Pandas, etc operators are pretty directly mappable to optimized GPU operators nowadays. This includes GPU-accelerated CSV/parquet parsing. This was theoretically true starting maybe 10 years ago, and implemented in practice about 3-5 years ago. These systems allow escape hatches via numbajit etc to do custom kernels, but it's better to stay in pure sql/pandas/etc subsets, which are already mapped and to more careful kernels.

To get a feel for times, I like to think about 2 classes: constant overheads and throughput

Constant overhead:

- JIT'ing. By using pure SQL/pandas/etc, you can avoid most CUDA JIT costs

- GPU context creation etc: Similar, after starting and initial memory pool is allocated, it gets reused

- Instruction passing: The pandas API is 'eager', so "df1() + df2()" may have a lot of back-and-forth of instructions between CPU<>GPU even if the data doesn't move. Dask & Polars introduce lazy semantics that allow fusion, but GPU implementations haven't leveraged that yet AFAICT.

Bandwith limits:

- SSD is the biggest killer. Even "Expensive" SSDs are still < 10GB/s, so you need to chain a bunch to get 100B/s ingest

- CPU pathways throttle things down again (latency+bandwidth): GDS/GDN lets you skip them

- PCIe cards are surprisingly fast nowadays. With PCIe5+, the bottleneck is getting pushed quickly back to the storage, and probably easier to buy more PCIe+GPU pairs than need individual to go faster for most workloads

- Once things hit the GPU, things are fast :)

4s is a LOT of time wrt what even commodity GPU hardware can do, so benchmarks showing software failing to saturate it is fascinating to diagnose


Wow! Super informative, thanks!!

I also apologize. As you can probably tell, I lumped you in with all the folks who were being super glib about easy hardware gains!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: