Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Linux Text Manipulation (yusuf.fyi)
112 points by zerojames on March 28, 2024 | hide | past | favorite | 44 comments


I enjoy reading things like this. It’s posts like this that have helped me build my command line text processing skills over the years.

If you are early in your career, I suggest you work on these types of skills. It is surprising how often I have found myself on a random box that I needed to parse application logs “by hand”. This happens to me even in fancy, K8-rich environments.


I agree.

Just FYI, Kubernetes is abbreviated to k8s, not k8.

> K8s as an abbreviation results from counting the eight letters between the "K" and the "s".

https://kubernetes.io/docs/concepts/overview/


This is very similar to "a11y"[0] and "i18n"[1]. The abbreviation of words using this technique has become surprisingly common in the software industry.

[0]: https://www.wordnik.com/words/a11y [1]: https://www.wordnik.com/words/i18n



Thank you! I truly do learn something new every day on here.


Sometimes an n7m?


Better watch out with Arabic speakers, 7 is used for a sound in Arabic we don't have in English.


I strongly dislike this practice, because it impedes understanding.

"a11y" is especially bad, since it runs counter to the very thing it is supposed to represent.

I hope the fad passes quickly.


I'd say the vision impaired are going to understand what "ay-one-one-why" means about as fast as the rest of us. I'm not a fan of the cutesy letter-number jargon either, if you're typing about it in Slack, sure, okay, but it shouldn't escape confinement.

But it's equal-opportunity annoying I reckon: no one knows what the hell `a11y` is about when they first see/hear it, but not in a way that's more onerous for screen reader and braille users than for anyone else.


I was using the term in the general sense: capable of being understood. Not the narrow sense that refers specifically to vision.


Sure, that's reasonable. Kind of circles back to "a11y" being technical language, which refers to a term of art, "accessibility", which is not identical to the word "accessibility" itself. This is at least part of why it gets used, although the main reason is really that a11y is easy to write and fast to read, while accessibility is neither.


i18n was a mystery for the longest time. a11y was just dumb for me until I learned what the numbers meant… last year, after what, two decades?

Btw txn is similar, but they didn’t bother with numbers, they just replaced the middle with an x.


It's been decades and I still don't know what is the original word for "l33t", but I reckon it must be quite large.


Only the elite know the answer to that


a16z andreesen horowitz


Some code challenge sites offer all their challenges in bash - I highly recommend working through these if you want to get better at this type of stuff. Some problems are surprisingly simple, others torturously difficult.


can you suggest few?


Been a couple of years since I did any competitive problems for shell and cannot recall - however this hackerrank page is how I do interview prep for anything shell related -

https://www.hackerrank.com/domains/shell


Is there something like leetcode for string manipulation exercises like this?


> If you are early in your career, I suggest you work on these types of skills. It is surprising how often I have found myself on a random box that I needed to parse application logs “by hand”. This happens to me even in fancy, K8-rich environments.

It’s surprising how many times you have to ad hoc parse due to the tools being so poor. It’s endemic.


Regex can help you with fairly complicated source code edits too, like changing the order of parameters in some multi language project where there's no automated tool that can just do it.


I've been saving a lot of time in the terminal recently with shell-gpt (https://github.com/tbckr/sgpt):

    $ sgpt -s "The command 'sp current' outputs
    > Album        Tea For The Tillerman (Remastered 2020)
    > AlbumArtist  Yusuf / Cat Stevens
    > Artist       Yusuf / Cat Stevens
    > Title        Wild World
    > I want just 'Wild World by Yusuf/Cat Stevens'"

    sp current | awk -F'  +' '/Title/{title=$2} /Artist/{artist=$2} END{print title " by " artist}'
    [E]xecute, [D]escribe, [A]bort: A
    E

    Wild World by Yusuf / Cat Stevens
Looking back at it, the awk command it uses is actually pretty clean.


It should use ^Title and ^Artist, otherwise something with "Artist" or "Title" will give wonky results.

More importantly, you can get the same results by just one or two dbus-send commands instead of using that "sp" script + something to clean it up.

The sp-metadata function uses tons of processes to clean the output[1], and sp-current launches a few more. If you're doing this in a loop for your WM status display this sort of stuff really adds up. Even on modern systems launching processes isn't free and relatively slow, and launching >20 of them every few seconds is going to use non-trivial amounts of CPU and will needlessly drain your battery.

I don't really know what that dbus command outputs, but I bet that with you might go a long way with "dbus-send ... | grep -o ..." or something.

So in general I'd say this is a classic case of "you're not even using the right solution, and no amount of GPT is going to help".

[1]: https://gist.github.com/streetturtle/fa6258f3ff7b17747ee3#fi...


The "^Title and ^Artist" is a good point.

The "sp current" issue seems less fair to bring up, as the entire article was using that throughout as well.


That's really awesome!

Shell scripting is the perfect use case for ChatGPT. Simple enough that AI can handle it, annoying enough that I don't really want to do it myself, and something I do rarely enough that I don't really know any of the commands deeply.


Very cool! It reminds me of Microsoft Prose[1]. I guess they were way ahead of their time on that one.

[1]https://www.microsoft.com/en-us/research/group/prose/


That's a nice tool, I'll have to try it.

Out of curiosity I asked bog-standard web ChatGPT (4) if it could do the job in awk, took it five times to get it right. Whatever prompt shell-gpt is using, it works.


That's terrifying. How do you guarantee this (or its user) doesn't end up running `sed -i`?


I'd prefer a general approach that used the first column as a key, and the rest as the value...into a dict/hash. Then if you need the Album title or something else later, it's easy to alter.

I'm sure awk could do that, but with Perl:

  sp current | perl -nE '/(\S+)\s+(.*)/ and $d{$1}=$2;END{say "$d{Title} by $d{Artist}"}'


The omission of Perl from this post was pretty striking - not even mentioned in the final thoughts (instead thought to do the whole thing in awk??).

Perl has fallen from grace as a general programming language and even as a systems administration language, but it's still absolutely the best and most ubiquitous tool for text manipulation.


Can you write a Perl program to replace the content of anything found inside double-quotes with the result of piping it to the command `fmt`?


Easily, though you don't need to drag external programs like fmt into it - perl comes with a standard module for word wrapping.

Text::Balanced (for extracting the quoted text) and Text::Wrap would be the core of such a program.

https://perldoc.perl.org/Text::Balanced

https://perldoc.perl.org/Text::Wrap


Could you write it for me as a one-liner? I’d like to learn more Perl and this would help.


Oh, I wouldn't try to squeeze it down into a one liner. Maybe for a simple definition of the quoted text that doesn't need to account for edge cases like escaped quotes inside...


Just a normal program then? No edge cases needed


    awk '{ k = $1; sub("^[^ ]* *", "", $0); d[k] = $0; } END { print d["Title"], "by", d["Artist"]; }'
Personally I'm a bit of a sed fan

    sed -nE '/^Artist/{N;s/^Artist +(.*)\nTitle +(.*)/\2 by \1/p;}'


I am curious as to why not just use:

    playerctl metadata artist
    playerctl metadata title
as provided by MPRIS

https://wiki.archlinux.org/title/MPRIS

> MPRIS (Media Player Remote Interfacing Specification) is a standard D-Bus interface which aims to provide a common programmatic API for controlling media players.

> It provides a mechanism for discovery, querying and basic playback control of compliant media players, as well as a track list interface which is used to add context to the active media item.


Because it's an article about linux text manipulation, not about solving this specific problem.


There's a lot of better ways to do this. For starters, the sp script is bash. He could just edit the script. There's also a function for returning the metadata in machine-parseable syntax:

  function sp-metadata {
    # Prints the currently playing track in a parseable format.
  
    dbus-send                                                                   \
    --print-reply                                  `# We need the reply.`       \
    --dest=$SP_DEST                                                             \
    $SP_PATH                                                                    \
    org.freedesktop.DBus.Properties.Get                                         \
    string:"$SP_MEMB" string:'Metadata'                                         \
    | grep -Ev "^method"                           `# Ignore the first line.`   \
    | grep -Eo '("(.*)")|(\b[0-9][a-zA-Z0-9.]*\b)' `# Filter interesting fiels.`\
    | sed -E '2~2 a|'                              `# Mark odd fields.`         \
    | tr -d '\n'                                   `# Remove all newlines.`     \
    | sed -E 's/\|/\n/g'                           `# Restore newlines.`        \
    | sed -E 's/(xesam:)|(mpris:)//'               `# Remove ns prefixes.`      \
    | sed -E 's/^"//'                              `# Strip leading...`         \
    | sed -E 's/"$//'                              `# ...and trailing quotes.`  \
    | sed -E 's/"+/|/'                             `# Regard "" as seperator.`  \
    | sed -E 's/ +/ /g'                            `# Merge consecutive spaces.`
  }

But as the other replier mentioned, the point was to show off an example of how text manipulation skills can solve many problems, not solve this specific problem in the best way possible.


TIL a neat way to add comments to continued shell lines. Thanks!


  $ txr by.txr spdata
  Wild World by Yusuf / Cat Stevens

  $ cat by.txr
  @(gather)
  Artist @artist
  Title @title
  @(end)
  @(output)
  @title by @artist
  @(end)
For one-liner outputs, I often use @(do (put-line `@title by @artist`)).


I was scratching my head for a bit before I realized the final script in the article produces a slightly different output than the previous diff shows (Yusuf/Cat vs Yusuf / Cat). Anyway, here's a Nushell version. There's likely a way to use "detect columns" here but it doesn't seem to like the repeated value or something.

  $ cat sp_out | lines | parse '{key} {value}' | str trim | transpose -rd | format pattern '{Title} by {Artist}'
  Wild World by Yusuf / Cat Stevens
https://www.nushell.sh/cookbook/parsing.html


I use python more often than tools like awk, which I often forget the syntax of, so I made pyxargs to quickly run python code in the shell for small tasks like this

  sp current | pyxr -0 -g "(Artist)\s+(.+)\n(Title)\s+(.+)" -p "{3} by {1}"


For fun, I tried doing that with duckdb, and here is my result:

    duckdb -list -c "from (from read_text('sp-current') select regexp_extract_all(content, '(\S+)\s+(.+)', 2) as value) select concat(value[4], ' by ', value[2])" | sed -n 2p




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: