In the dd(1) case, we're talking about "having any pipe involved at all" vs "no pipe, just copying internal to the command." The Linux kernel pipe buffer size is only 64KB, while my hand-optimized `bs` usually lands at ~2MB. There's a big performance gap introduced by serially copying tiny (non-IO-queue-saturating) chunks at a time — it can literally be a difference of minutes vs. hours to complete a copy. Especially when there's high IO latency on one end, e.g. on IaaS network disks.