Linux Text Manipulation

bikingbismuth · on March 28, 2024

I enjoy reading things like this. It’s posts like this that have helped me build my command line text processing skills over the years.

If you are early in your career, I suggest you work on these types of skills. It is surprising how often I have found myself on a random box that I needed to parse application logs “by hand”. This happens to me even in fancy, K8-rich environments.

cassianoleal · on March 28, 2024

I agree.

Just FYI, Kubernetes is abbreviated to k8s, not k8.

> K8s as an abbreviation results from counting the eight letters between the "K" and the "s".

https://kubernetes.io/docs/concepts/overview/

tentacleuno · on March 28, 2024

This is very similar to "a11y"[0] and "i18n"[1]. The abbreviation of words using this technique has become surprisingly common in the software industry.

[0]: https://www.wordnik.com/words/a11y [1]: https://www.wordnik.com/words/i18n

nrabulinski · on March 28, 2024

It’s called a numeronym https://en.m.wikipedia.org/wiki/Numeronym

tentacleuno · on March 28, 2024

Thank you! I truly do learn something new every day on here.

bregma · on March 28, 2024

Sometimes an n7m?

pseingatl · on March 28, 2024

Better watch out with Arabic speakers, 7 is used for a sound in Arabic we don't have in English.

foresto · on March 29, 2024

I strongly dislike this practice, because it impedes understanding.

"a11y" is especially bad, since it runs counter to the very thing it is supposed to represent.

I hope the fad passes quickly.

samatman · on March 29, 2024

I'd say the vision impaired are going to understand what "ay-one-one-why" means about as fast as the rest of us. I'm not a fan of the cutesy letter-number jargon either, if you're typing about it in Slack, sure, okay, but it shouldn't escape confinement.

But it's equal-opportunity annoying I reckon: no one knows what the hell `a11y` is about when they first see/hear it, but not in a way that's more onerous for screen reader and braille users than for anyone else.

foresto · on March 29, 2024

I was using the term in the general sense: capable of being understood. Not the narrow sense that refers specifically to vision.

samatman · on March 29, 2024

Sure, that's reasonable. Kind of circles back to "a11y" being technical language, which refers to a term of art, "accessibility", which is not identical to the word "accessibility" itself. This is at least part of why it gets used, although the main reason is really that a11y is easy to write and fast to read, while accessibility is neither.

baq · on March 28, 2024

i18n was a mystery for the longest time. a11y was just dumb for me until I learned what the numbers meant… last year, after what, two decades?

Btw txn is similar, but they didn’t bother with numbers, they just replaced the middle with an x.

ASalazarMX · on March 30, 2024

It's been decades and I still don't know what is the original word for "l33t", but I reckon it must be quite large.

nikau · on March 31, 2024

Only the elite know the answer to that

FergusArgyll · on March 28, 2024

a16z andreesen horowitz

JohnMakin · on March 28, 2024

Some code challenge sites offer all their challenges in bash - I highly recommend working through these if you want to get better at this type of stuff. Some problems are surprisingly simple, others torturously difficult.

s3arch · on March 29, 2024

can you suggest few?

JohnMakin · on March 30, 2024

Been a couple of years since I did any competitive problems for shell and cannot recall - however this hackerrank page is how I do interview prep for anything shell related -

https://www.hackerrank.com/domains/shell

reidjs · on March 28, 2024

Is there something like leetcode for string manipulation exercises like this?

keybored · on March 28, 2024

> If you are early in your career, I suggest you work on these types of skills. It is surprising how often I have found myself on a random box that I needed to parse application logs “by hand”. This happens to me even in fancy, K8-rich environments.

It’s surprising how many times you have to ad hoc parse due to the tools being so poor. It’s endemic.

eternityforest · on March 29, 2024

Regex can help you with fairly complicated source code edits too, like changing the order of parameters in some multi language project where there's no automated tool that can just do it.

thomasahle · on March 28, 2024

I've been saving a lot of time in the terminal recently with shell-gpt (https://github.com/tbckr/sgpt):

    $ sgpt -s "The command 'sp current' outputs
    > Album        Tea For The Tillerman (Remastered 2020)
    > AlbumArtist  Yusuf / Cat Stevens
    > Artist       Yusuf / Cat Stevens
    > Title        Wild World
    > I want just 'Wild World by Yusuf/Cat Stevens'"

    sp current | awk -F'  +' '/Title/{title=$2} /Artist/{artist=$2} END{print title " by " artist}'
    [E]xecute, [D]escribe, [A]bort: A
    E

    Wild World by Yusuf / Cat Stevens

Looking back at it, the awk command it uses is actually pretty clean.

arp242 · on March 28, 2024

It should use ^Title and ^Artist, otherwise something with "Artist" or "Title" will give wonky results.

More importantly, you can get the same results by just one or two dbus-send commands instead of using that "sp" script + something to clean it up.

The sp-metadata function uses tons of processes to clean the output[1], and sp-current launches a few more. If you're doing this in a loop for your WM status display this sort of stuff really adds up. Even on modern systems launching processes isn't free and relatively slow, and launching >20 of them every few seconds is going to use non-trivial amounts of CPU and will needlessly drain your battery.

I don't really know what that dbus command outputs, but I bet that with you might go a long way with "dbus-send ... | grep -o ..." or something.

So in general I'd say this is a classic case of "you're not even using the right solution, and no amount of GPT is going to help".

[1]: https://gist.github.com/streetturtle/fa6258f3ff7b17747ee3#fi...

thomasahle · on March 30, 2024

The "^Title and ^Artist" is a good point.

The "sp current" issue seems less fair to bring up, as the entire article was using that throughout as well.

eternityforest · on March 29, 2024

That's really awesome!

Shell scripting is the perfect use case for ChatGPT. Simple enough that AI can handle it, annoying enough that I don't really want to do it myself, and something I do rarely enough that I don't really know any of the commands deeply.

mhuffman · on March 28, 2024

Very cool! It reminds me of Microsoft Prose[1]. I guess they were way ahead of their time on that one.

[1]https://www.microsoft.com/en-us/research/group/prose/

samatman · on March 29, 2024

That's a nice tool, I'll have to try it.

Out of curiosity I asked bog-standard web ChatGPT (4) if it could do the job in awk, took it five times to get it right. Whatever prompt shell-gpt is using, it works.

thomastjeffery · on March 28, 2024

That's terrifying. How do you guarantee this (or its user) doesn't end up running `sed -i`?

tyingq · on March 28, 2024

I'd prefer a general approach that used the first column as a key, and the rest as the value...into a dict/hash. Then if you need the Album title or something else later, it's easy to alter.

I'm sure awk could do that, but with Perl:

  sp current | perl -nE '/(\S+)\s+(.*)/ and $d{$1}=$2;END{say "$d{Title} by $d{Artist}"}'

Zhyl · on March 28, 2024

The omission of Perl from this post was pretty striking - not even mentioned in the final thoughts (instead thought to do the whole thing in awk??).

Perl has fallen from grace as a general programming language and even as a systems administration language, but it's still absolutely the best and most ubiquitous tool for text manipulation.

undershirt · on March 28, 2024

Can you write a Perl program to replace the content of anything found inside double-quotes with the result of piping it to the command `fmt`?

shawn_w · on March 28, 2024

Easily, though you don't need to drag external programs like fmt into it - perl comes with a standard module for word wrapping.

Text::Balanced (for extracting the quoted text) and Text::Wrap would be the core of such a program.

https://perldoc.perl.org/Text::Balanced

https://perldoc.perl.org/Text::Wrap

undershirt · on March 28, 2024

Could you write it for me as a one-liner? I’d like to learn more Perl and this would help.

shawn_w · on March 28, 2024

Oh, I wouldn't try to squeeze it down into a one liner. Maybe for a simple definition of the quoted text that doesn't need to account for edge cases like escaped quotes inside...

undershirt · on March 29, 2024

Just a normal program then? No edge cases needed

evh · on March 28, 2024

    awk '{ k = $1; sub("^[^ ]* *", "", $0); d[k] = $0; } END { print d["Title"], "by", d["Artist"]; }'

Personally I'm a bit of a sed fan

    sed -nE '/^Artist/{N;s/^Artist +(.*)\nTitle +(.*)/\2 by \1/p;}'

tetris11 · on March 28, 2024

I am curious as to why not just use:

    playerctl metadata artist
    playerctl metadata title

as provided by MPRIS

https://wiki.archlinux.org/title/MPRIS

> MPRIS (Media Player Remote Interfacing Specification) is a standard D-Bus interface which aims to provide a common programmatic API for controlling media players.

> It provides a mechanism for discovery, querying and basic playback control of compliant media players, as well as a track list interface which is used to add context to the active media item.

hifromwork · on March 28, 2024

Because it's an article about linux text manipulation, not about solving this specific problem.

dmonitor · on March 29, 2024

There's a lot of better ways to do this. For starters, the sp script is bash. He could just edit the script. There's also a function for returning the metadata in machine-parseable syntax:

  function sp-metadata {
    # Prints the currently playing track in a parseable format.
  
    dbus-send                                                                   \
    --print-reply                                  `# We need the reply.`       \
    --dest=$SP_DEST                                                             \
    $SP_PATH                                                                    \
    org.freedesktop.DBus.Properties.Get                                         \
    string:"$SP_MEMB" string:'Metadata'                                         \
    | grep -Ev "^method"                           `# Ignore the first line.`   \
    | grep -Eo '("(.*)")|(\b[0-9][a-zA-Z0-9.]*\b)' `# Filter interesting fiels.`\
    | sed -E '2~2 a|'                              `# Mark odd fields.`         \
    | tr -d '\n'                                   `# Remove all newlines.`     \
    | sed -E 's/\|/\n/g'                           `# Restore newlines.`        \
    | sed -E 's/(xesam:)|(mpris:)//'               `# Remove ns prefixes.`      \
    | sed -E 's/^"//'                              `# Strip leading...`         \
    | sed -E 's/"$//'                              `# ...and trailing quotes.`  \
    | sed -E 's/"+/|/'                             `# Regard "" as seperator.`  \
    | sed -E 's/ +/ /g'                            `# Merge consecutive spaces.`
  }

But as the other replier mentioned, the point was to show off an example of how text manipulation skills can solve many problems, not solve this specific problem in the best way possible.

junkblocker · on April 2, 2024

TIL a neat way to add comments to continued shell lines. Thanks!

kazinator · on March 28, 2024

  $ txr by.txr spdata
  Wild World by Yusuf / Cat Stevens

  $ cat by.txr
  @(gather)
  Artist @artist
  Title @title
  @(end)
  @(output)
  @title by @artist
  @(end)

For one-liner outputs, I often use @(do (put-line `@title by @artist`)).

frankjr · on March 28, 2024

I was scratching my head for a bit before I realized the final script in the article produces a slightly different output than the previous diff shows (Yusuf/Cat vs Yusuf / Cat). Anyway, here's a Nushell version. There's likely a way to use "detect columns" here but it doesn't seem to like the repeated value or something.

  $ cat sp_out | lines | parse '{key} {value}' | str trim | transpose -rd | format pattern '{Title} by {Artist}'
  Wild World by Yusuf / Cat Stevens

https://www.nushell.sh/cookbook/parsing.html

elesiuta · on March 28, 2024

I use python more often than tools like awk, which I often forget the syntax of, so I made pyxargs to quickly run python code in the shell for small tasks like this

  sp current | pyxr -0 -g "(Artist)\s+(.+)\n(Title)\s+(.+)" -p "{3} by {1}"

jiehong · on April 2, 2024

For fun, I tried doing that with duckdb, and here is my result:

    duckdb -list -c "from (from read_text('sp-current') select regexp_extract_all(content, '(\S+)\s+(.+)', 2) as value) select concat(value[4], ' by ', value[2])" | sed -n 2p