Idiomatic Awk (2010)

rusk · 2024-10-06T13:05:51 1728219951

I love awk as a language / framework. If it got an uplift to make it useful for more complex problems it would be an absolute winner for lots of basic data processing tasks.

I’ve often written incantations many lines long and even broken out to actually writing awk scripts from time to time.

Once the penny drops with it it’s great fun but it’s absolutely useless once your problems get to any degree of sophistication.

I typically move into Python at this stage. Perl and Ruby are probably a more elegant fit here but those aren’t rows I want to have.

In this day and age, awk really needs CSV (RFC-4180) support and better semantic scoping and library support.

I’d also think it would be neat as an embedded language for various data processing platforms but if we haven’t seen it yet I doubt we will ever see it.

EDIT support for file formats beyond plain text would also be a winner.

Rotundo · 2024-10-06T13:48:47 1728222527

Good news! Not only are there AWK implementations (for instance GoAWK) that can use CSV, but even the one true AWK does do CSV and Unicode nowadays!

https://arstechnica.com/gadgets/2022/08/unix-legend-who-owes...

theamk · 2024-10-08T13:58:24 1728395904

That's why I never write awk one-liners anymore: if I cannot get it done with cut/sed, I jump to perl right away. After all, it was designed to be AWK replacement, and has features like autosplit (-a) and line-loop (-p) [1] that specifically designed to make porting awk's programs easy. And if you need more than simple string matching, Perl still maintains huge CPAN library of modules, so JSON or Date::Parse is one command away.

(Note I still use Python/C++ for real programs; perl is only for one-liners to replace the AWK)

[0] https://news.ycombinator.com/item?id=36650120

[1] https://perldoc.perl.org/perlrun

AstroJetson · 2024-10-09T08:02:37 1728460957

> EDIT support for file formats beyond plain text would also be a winner.

Can you say some more about this, what kind of file format do you want? JSON? XML?

rusk · 2024-10-09T18:01:15 1728496875

Excel

AstroJetson · 2024-10-12T12:14:26 1728735266

AWK now supports CSV files from Excel. I’ve used it a few times and have been pretty happy with it. Have you tried that?

Elfener · 2024-10-06T11:32:34 1728214354

Technically an empty awk program is an implementation of cat(1) (before it came back from berkeley waving flags, anyway).

Of course no awk will run an empty program, so the article's '1' or '"a"' or other truthy value is required.

sevensor · 2024-10-07T13:21:14 1728307274

I’ve been appreciating awk more and more lately as a desktop calculator. There are things dc and bc can’t handle, that awk will cheerfully compete for me. Of all the tools I can expect to be present on a Linux machine, it gives me the easiest way to compute a logarithm in a shell pipeline.

mananaysiempre · 2024-10-06T12:56:08 1728219368

Funny how exploiting uninitialized variables is evil in (nonstrict) JavaScript but good in Awk. I agree that’s true, mind you, I just can’t pinpoint what about the languages’ designs makes it work in one case but not the other.

AdieuToLogic · 2024-10-07T03:07:43 1728270463

> Funny how exploiting uninitialized variables is evil in (nonstrict) JavaScript but good in Awk. I agree that’s true, mind you, I just can’t pinpoint what about the languages’ designs makes it work in one case but not the other.

I believe Awk's semantic behaviour is a kindred spirit of Perl's autovivication[0].

It is often handy and sometimes can bite one in the proverbial backside.

0 - https://en.wikipedia.org/wiki/Autovivification

7thaccount · 2024-10-07T11:32:54 1728300774

Awk is meant for short scripts (often one liners) that get some piped input. A JS program could be used in that manner, but I'd guess it'd be more like an exception than the norm. Different tools for different needs.

theamk · 2024-10-07T03:53:56 1728273236

A lot of things are good in language designed for one-liners, but will be considered evil in languages which are used for longer programs.

For an opposite example, consider: a 1000 statement JS program is perfectly fine. A 1000 statement AWK script is very evil, most people would be saying it should be replaced with real language ASAP.

asicsp · 2024-10-06T13:48:58 1728222538

Not sure if it was intended, but it helps to keep CLI one-liners concise. For example:

    awk '!a[$0]++'

    awk '/search/{n=2} n && n--'

rakoo · 2024-10-07T08:02:39 1728288159

I think it's ok in awk because awk is seen as a scripting language, where brevity matters and there aren't many variables anyway

rusk · 2024-10-06T13:10:03 1728220203

Guaranteed initialisation rules I would imagine. A lot of languages don’t do this, or assign null or something. For a language with a narrow scope of application you can this but in more general purpose languages it makes less sense.

mananaysiempre · 2024-10-06T13:15:50 1728220550

Note Awk does have a special value for uninitialized variables—its distinguishing feature is that it compares equal to both zero and empty string simultaneously. (No official way to spell that value, but you can spell it e.g. “undefined” if you never initialize a variable called “undefined”.) It’s really really close to old JS in that respect, and no wonder, given that Awk was an explicit inspiration for JS. But somehow it works for Awk and not for JS, even at the hundred-line scale that’s common and reasonable for both languages.

rusk · 2024-10-06T18:04:54 1728237894

I never considered that awk was a precursor to JavaScript but I can certainly see that

mananaysiempre · 2024-10-06T18:48:19 1728240499

Found it! Wirfs-Brock & Eich, “JavaScript: the first 20 years” (HOPL ’20) [1], §3.1:

> The syntax of JavaScript 1.0 was directly modeled after the statement syntax of the C programming language with some AWK inspired embellishments.

(the rest of the section is also interesting in that respect).

[1] https://dl.acm.org/doi/10.1145/3386327 (CC-BY open access)

rusk · 2024-10-06T21:05:59 1728248759

I could feel it all along!

hi41 · 2024-10-06T13:34:39 1728221679

Doesn’t awk set null for all uninitialized variables. Does that solve the problem for you? We can use that feature to find lines with same values which form a group.

ykonstant · 2024-10-07T16:09:07 1728317347

This is overall a good guide, especially emphasizing the 'condition {action}' pattern; it is a very elegant and clear construct. However, some of the suggestions lean on the "too clever" and can make the code incomprehensible for the newcomer. For instance, if you will use snippets like `awk '!a[$0]++'`, do make sure to comment their use for your sake and others'.

jnordwick · 2024-10-07T17:42:33 1728322953

Honestly that's not a very crazy pattern at all you should be able to read that.

asicsp · 2024-10-06T11:19:49 1728213589

Seems like the link should be https://backreference.org/2010/02/10/idiomatic-awk/ instead of https://backreference.org/index.html

StefanBatory · 2024-10-06T11:23:54 1728213834

Yes - but I am sure I sent the first link.

I'm not sure what has happened then.

kencausey · 2024-10-06T17:29:51 1728235791

I think that it is this in the head:

  <link rel="canonical" href="index.html">

I had to stop submitting ISC SANS posts because they similarly have a canonical link in their head which HN, for reasons I'm not sure about, use to 'correct' the URL, errantly.

dang · 2024-10-07T02:31:36 1728268296

Right you are!

I can turn that off for specific domains - if you email the relevant info to hn@ycombinator.com I'll take a look.

kencausey · 2024-10-07T19:46:14 1728330374

Thanks. I'm not sure it is going to be necessary. I checked the ISC diary site again yesterday and I think they have corrected their 'link rel="canonical"' header. I will try to submit one again when I see an interesting one.

mananaysiempre · 2024-10-06T12:53:43 1728219223

Also, (2010).

Lammy · 2024-10-06T18:23:24 1728239004

Wait until you find out how old AWK is

dang · 2024-10-07T02:31:45 1728268305

Fixed now. Thanks!

elteto · 2024-10-06T12:48:30 1728218910

I wanted to like awk and tried really hard but in the end was disappointed by what I see as unnecessary complications or limitations in the language. For example, it has first class support for regexes but only for matching. You can’t do ‘s/foo/bar’. I also found string manipulation to be cumbersome with the string functions. I would have expected a string processing language to have better primitives for this. And function arguments/variables are just a mess, it’s hard to understand how they came up with that design. It’s also quirky and unintuitive in some places you would not expect. Take the non-working example from the article:

    awk -v FS=';' -v OFS=',' 1

I expect this to change the change the separator in the output. Period. The “efficiency” argument for why it doesn’t work just doesn’t cut it for me. First, it’s very simple to do a one time comparison of FS and OFS, if they are different then you know you know you _have_ to perform the change, because the user is asking you! If I do this in reality and it doesn’t work I just switch over to sed or perl and call it a day.

All in all, perl -eP is a better awk. And for data processing I switched to miller. It has it’s idiosyncrasies as well but it’s much better for working with structured records.

shakna · 2024-10-06T13:32:39 1728221559

Awk does support that kind of pattern matching for replacement...?

    { gsub(/\;/, ","); print }

CRConrad · 2024-10-16T09:46:28 1729071988

> And for data processing I switched to miller.

Damn those far-too-common-word names! Betcha if I googled that, I'd get page after page of results pointing to actor Johnny Miller... Any link?

[EDIT:] Naah, sorry, just had to search for "miller language" (and remove the superfluous "academy" from "miller language academy" that Google "helpfully" added), and found: https://miller.readthedocs.io/en/latest/miller-programming-l... . Thanks, interesting!