I guess it depends on what we mean by "throwing hardware at it."
GPUs aren't magic. You still need to come up with a parallelizable algorithm.
The TL;DR is that the fastest solutions are basically map/reduce with a bunch of microoptimizations for parsing each line.
But before you do that, you need to divide up the work. You can't just give each core `file_size_bytes/core_count` chunks of the file because those chunks won't align with the line breaks. So, you need to be clever about that part somehow.
Once you've done that, you have a nice map/reduce that should scale linearly up to at least 20 or 30 cores. So in that sense, you can "throw hardware at it."
Whether or not any of that is a good fit for GPU acceleration, I don't know.
You should try the challenge. It's trickier than you think but surprisingly fun.
Your intuition about mapping to kernels is good. Basically all SQL, Polars, DuckDB, Pandas, etc operators are pretty directly mappable to optimized GPU operators nowadays. This includes GPU-accelerated CSV/parquet parsing. This was theoretically true starting maybe 10 years ago, and implemented in practice about 3-5 years ago. These systems allow escape hatches via numbajit etc to do custom kernels, but it's better to stay in pure sql/pandas/etc subsets, which are already mapped and to more careful kernels.
To get a feel for times, I like to think about 2 classes: constant overheads and throughput
Constant overhead:
- JIT'ing. By using pure SQL/pandas/etc, you can avoid most CUDA JIT costs
- GPU context creation etc: Similar, after starting and initial memory pool is allocated, it gets reused
- Instruction passing: The pandas API is 'eager', so "df1() + df2()" may have a lot of back-and-forth of instructions between CPU<>GPU even if the data doesn't move. Dask & Polars introduce lazy semantics that allow fusion, but GPU implementations haven't leveraged that yet AFAICT.
Bandwith limits:
- SSD is the biggest killer. Even "Expensive" SSDs are still < 10GB/s, so you need to chain a bunch to get 100B/s ingest
- CPU pathways throttle things down again (latency+bandwidth): GDS/GDN lets you skip them
- PCIe cards are surprisingly fast nowadays. With PCIe5+, the bottleneck is getting pushed quickly back to the storage, and probably easier to buy more PCIe+GPU pairs than need individual to go faster for most workloads
- Once things hit the GPU, things are fast :)
4s is a LOT of time wrt what even commodity GPU hardware can do, so benchmarks showing software failing to saturate it is fascinating to diagnose
GPUs aren't magic. You still need to come up with a parallelizable algorithm.
The TL;DR is that the fastest solutions are basically map/reduce with a bunch of microoptimizations for parsing each line.
But before you do that, you need to divide up the work. You can't just give each core `file_size_bytes/core_count` chunks of the file because those chunks won't align with the line breaks. So, you need to be clever about that part somehow.
Once you've done that, you have a nice map/reduce that should scale linearly up to at least 20 or 30 cores. So in that sense, you can "throw hardware at it."
Whether or not any of that is a good fit for GPU acceleration, I don't know.
You should try the challenge. It's trickier than you think but surprisingly fun.