I guess it depends on what we mean by "throwing hardware at it." GPUs aren't mag...

lmeyerov · on March 3, 2024

Indeed!

You may enjoy this talk where I do just that... end-to-end on GPUs, and < 100loc Python: https://www.youtube.com/watch?v=8ZMzsTbfImU

Your intuition about mapping to kernels is good. Basically all SQL, Polars, DuckDB, Pandas, etc operators are pretty directly mappable to optimized GPU operators nowadays. This includes GPU-accelerated CSV/parquet parsing. This was theoretically true starting maybe 10 years ago, and implemented in practice about 3-5 years ago. These systems allow escape hatches via numbajit etc to do custom kernels, but it's better to stay in pure sql/pandas/etc subsets, which are already mapped and to more careful kernels.

To get a feel for times, I like to think about 2 classes: constant overheads and throughput

Constant overhead:

- JIT'ing. By using pure SQL/pandas/etc, you can avoid most CUDA JIT costs

- GPU context creation etc: Similar, after starting and initial memory pool is allocated, it gets reused

- Instruction passing: The pandas API is 'eager', so "df1() + df2()" may have a lot of back-and-forth of instructions between CPU<>GPU even if the data doesn't move. Dask & Polars introduce lazy semantics that allow fusion, but GPU implementations haven't leveraged that yet AFAICT.

Bandwith limits:

- SSD is the biggest killer. Even "Expensive" SSDs are still < 10GB/s, so you need to chain a bunch to get 100B/s ingest

- CPU pathways throttle things down again (latency+bandwidth): GDS/GDN lets you skip them

- PCIe cards are surprisingly fast nowadays. With PCIe5+, the bottleneck is getting pushed quickly back to the storage, and probably easier to buy more PCIe+GPU pairs than need individual to go faster for most workloads

- Once things hit the GPU, things are fast :)

4s is a LOT of time wrt what even commodity GPU hardware can do, so benchmarks showing software failing to saturate it is fascinating to diagnose

JohnBooty · on March 3, 2024

Wow! Super informative, thanks!!

I also apologize. As you can probably tell, I lumped you in with all the folks who were being super glib about easy hardware gains!