> slow and expensive du commands You'd be surprised how cheap these du(1) can be...

> slow and expensive du commands

You'd be surprised how cheap these du(1) can be when you're running the same du(1) command over and over. Think of it like running the same SQL query over and over — the first time you do it, the DBMS takes its time doing IO to pull the relevant disk pages into the disk cache; but the Nth≥2 time, the query is entirely over "hot" data. Hot filesystem metadata pages, in this case. (Plus, for the file(s) that were just written by your command, the query is hot because those pages are still in memory from being recently dirty.)

I regularly unpack tarballs containing 10 million+ files; and periodic du(1) over these takes only a few milliseconds of wall-clock time to complete.

(The other bottleneck with du(1), for deep file hierarchies, is printing all the subdirectory sizes. Which is why the `-d 0` — to only print the total.)

You might be worried about something else thrashing the disk cache, but in my experience I've never needed to run an ETL-like job on a system that's also running some other completely orthogonal IO-heavy prod workload. Usually such jobs are for restoring data onto new systems, migrating data between systems, etc.; where if there is any prod workload running on the box, it's one that's touching all the same data you're touching, and so keeping disk-cache coherency.