NVMe hungers, keeping it fed is hard work. Doing some serial read, decompress, checksum, write loop will leave if starved (QD<1) whenever you're doing anything but the last step.
Disk IO isn't async unless you use io_uring (well ok, writeback caches can be). So threads are almost a must to keep NVMe busy.
Conversely, waiting for blocking IO (e.g. directory enumeration) will keep your CPU starved. Here too the answer is more threads.