Hacker News new | past | comments | ask | show | jobs | submit login

It does not; the decompression is memory to memory, one tensor at a time, so it’s worse. They claim less than 200 GB/s on an A100, and their benchmarks suggest it’s somewhere between 1.5-4x slower at batch size 1 depending on GPU and model. This overhead of course mostly disappears with a large enough batch size.

Other lossless codecs can hit 600 GB/s on the same hardware, so there should be some room for improvement. But A100’s raw memory bandwidth is 1.6 TB/s






Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: