As a person who runs a lot of ETL-like commands at work, I never find myself usi...

invalidator · on Oct 19, 2022

Try using "pv -d <pid>". It will monitor open files on the process and report progress on them.

1) this gets it out of the pipeline. 2) the program gets to have the named arguments. 3) pv's out put is on a separate terminal. 4) your job never needs to know.

Downside: it only sees the currently open files, so it doesn't work well for batch jobs. Still, it's handy to see which file it's on, and how fast the progress is.

Also, for rsync: "--info=progress2 --no-i-r" will show you the progress for a whole job.

gpderetta · on Oct 19, 2022

IIRC pv uses splice internally and simply tells the kernel to mive pipe buffers from one pipe to the other, so it is very unlikely to be a bottleneck.

derefr · on Oct 19, 2022

In the dd(1) case, we're talking about "having any pipe involved at all" vs "no pipe, just copying internal to the command." The Linux kernel pipe buffer size is only 64KB, while my hand-optimized `bs` usually lands at ~2MB. There's a big performance gap introduced by serially copying tiny (non-IO-queue-saturating) chunks at a time — it can literally be a difference of minutes vs. hours to complete a copy. Especially when there's high IO latency on one end, e.g. on IaaS network disks.

leni536 · on Oct 19, 2022

For rsync to get reliable global progress there is --no-i-r --info=progress2 . --no-i-r adds a bit of upfront work, but it's well worth it IMO.

derefr · on Oct 19, 2022

Thanks for that! (I felt like I had to be missing something, with how useless rsync progress usually was.)

prmoustache · on Oct 19, 2022

Sometimes you prefer predictability and information over sheer speed. If do a very large transfer that could take hours, I'd rather trade a bit of speed to know the progress and make sure nothing is stuck than launching in the blind and then repeat slow and expensive du commands to know where I am in the transfer or have to strace the process.

derefr · on Oct 19, 2022

> slow and expensive du commands

You'd be surprised how cheap these du(1) can be when you're running the same du(1) command over and over. Think of it like running the same SQL query over and over — the first time you do it, the DBMS takes its time doing IO to pull the relevant disk pages into the disk cache; but the Nth≥2 time, the query is entirely over "hot" data. Hot filesystem metadata pages, in this case. (Plus, for the file(s) that were just written by your command, the query is hot because those pages are still in memory from being recently dirty.)

I regularly unpack tarballs containing 10 million+ files; and periodic du(1) over these takes only a few milliseconds of wall-clock time to complete.

(The other bottleneck with du(1), for deep file hierarchies, is printing all the subdirectory sizes. Which is why the `-d 0` — to only print the total.)

You might be worried about something else thrashing the disk cache, but in my experience I've never needed to run an ETL-like job on a system that's also running some other completely orthogonal IO-heavy prod workload. Usually such jobs are for restoring data onto new systems, migrating data between systems, etc.; where if there is any prod workload running on the box, it's one that's touching all the same data you're touching, and so keeping disk-cache coherency.

MayeulC · on Oct 19, 2022

I usually fix 3. by redirecting the intermediate program to stderr before piping to pv.

My main use-case is netcat (nc).

As an aside, I prefer the BSD version, which I find is superior (IPv6 support, SOCKS, etc). "GNU Netcat" isn't even part of the GNU project, AFAIK. I also discovered Ncat while writing this, from the Nmap project; I'll give it a try.

derefr · on Oct 19, 2022

I don't quite understand what you mean — by default, most Unix-pipeline-y tools that produce on stdout, if they log at all, already write their logs to stderr (that being why stderr exists); and pv(1) already also writes to stderr (as if it wrote its progress to stdout, you wouldn't be able to use it in a pipe!)

But pv(1) is just blindly attempting to emit "\r[progress bar ASCII-art]\n" (plus a few regular lines) to stderr every second; and interleaving that into your PTY buffer along with actual lines of stderr output from your producer command, will just result in mush — a barrage of new progress bars on new lines, overwriting any lines emitted directly before them.

Having two things both writing to stderr, where one's trying to do something TUI-ish, and the other is attempting to write regular text lines, is the problem statement of 3, not the solution to it.

A solution, AFAICT, would look more like: enabling pv(1) to (somehow) capture the stderr of the entire command-line, and manage it, along with drawing the progress bar. Probably by splitting pv(1) into two programs — one that goes inside the command-line, watches progress, and emits progress logs as specially-tagged little messages (think: the UUID-like heredoc tags used in MIME-email binary-embeds) without any ANSI escape codes; and another, which wraps your whole command line, parsing out the messages emitted by the inner pv(1) to render a progress bar on the top/bottom of the PTY buffer, while streaming the regular lines across the rest of the PTY buffer. (Probably all on the PTY secondary buffer, like less(1) or a text editor.)

Another, probably simpler, solution would be to have a flag that tells pv(1) to log progress "events" (as JSON or whatever) to a named-FIFO filepath it would create (and then delete when the pipeline is over) — or to a loopback-interface TCP port it would listen on — and otherwise be silent; and then to have another command you can run asynchronously to your command-line, to open that named FIFO/connect to that port, and consume the events from it, rendering them as a progress bar; which would also quit when the FIFO gets deleted / when the socket is closed by the remote. Then you could run that command, instead of watch(2), in another tmux(2) pane, or wherever you like.

MayeulC · on Oct 21, 2022

My bad, I meant redirecting the stderr output to /dev/null. I lost my initial message and typed a short version too quickly :(

For a contrieved (but the first I can think of)

    dd if=/dev/abc status=progress 3>/dev/null | pv img

As for why I lost the message... I went over my HN time with the Leechblock extension (I know there's noprocrast here), anit doesn't cache POSTs...

gpderetta · on Oct 19, 2022

You could redirect each pipeline stage stderr to a fifo and tail it from another terminal. A bit annoying to do it by hand though.