Every time I see things like this, I feel like the person must be unaware of awk.
# the original one-liner to get unique IP addresses
cut -d' ' -f 1 access.log | sort | uniq -c | sort -rn | head
# turns into this with GNU awk
gawk '{PROCINFO["sorted_in"] = "@val_num_desc"; a[$1]++} END {c=0; for (i in a) if (c++ < 10) print a[i], i}' access.log
It's also far, far faster on larger files (base-spec M1 Air):
$ wc -lc fake_log.txt
1000000 218433264 fake_log.txt
$ hyperfine "gawk '{PROCINFO[\"sorted_in\"] = \"@val_num_desc\"; a[\$1]++} END {c=0; for (i in a) if (c++ <10) print a[i], i}' fake_log.txt"
Benchmark 1: gawk '{PROCINFO["sorted_in"] = "@val_num_desc"; a[$1]++} END {c=0; for (i in a) if (c++ <10) print a[i], i}' fake_log.txt
Time (mean ± σ): 1.250 s ± 0.003 s [User: 1.185 s, System: 0.061 s]
Range (min … max): 1.246 s … 1.254 s 10 runs
$ hyperfine "cut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head"
Benchmark 1: cut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head
Time (mean ± σ): 4.844 s ± 0.020 s [User: 5.367 s, System: 0.087 s]
Range (min … max): 4.817 s … 4.873 s 10 runs
Interestingly, GNU cut is significantly faster than BSD cut on the M1:
$ hyperfine "gcut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head"
Benchmark 1: gcut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head
Time (mean ± σ): 3.622 s ± 0.004 s [User: 4.149 s, System: 0.078 s]
Range (min … max): 3.616 s … 3.629 s 10 runs
The overwhelming cost of the first shell pipeline, at least on my machine, is caused by the default UTF-8 locale. As I have found in almost every other case, `LC_ALL=C` radically speeds this up.
By the way, these changes immediately suggested themselves after running the pipeline under `perf`. Profiling is always the first step in optimization.
Collation aside (which is absolutely a huge boost in speed that I neglected to think about), I assumed that the rest of the difference was coming from the fact that the initial `cut` meant the rest of the pipeline had far less to deal with, whereas `awk` is processing every line. Benchmarking (and testing in `perf`) showed this to not be the case. I'd need to compile `awk` with debug symbols, I think, to know exactly where the slowdown is, but I'm going to assume it's mostly due to `sort` being extremely optimized for doing one thing, and doing it well.
I did find one other interesting difference between BSD and GNU tools - BSD sort defaults to 90% for its buffer, GNU sort defaults to 1024 KiB.
Combining all of these (and using GNU uniq - it was also faster), I was able to get down to 463 msec on the M1 Air:
$ hyperfine "export LC_ALL=C; gcut -d' ' -f1 fake_log.txt | gsort -S5% | guniq -c | gsort -rn -S5% | head"
Benchmark 1: export LC_ALL=C; gcut -d' ' -f1 fake_log.txt | gsort -S5% | guniq -c | gsort -rn -S5% | head
Time (mean ± σ): 463.4 ms ± 3.3 ms [User: 965.5 ms, System: 93.3 ms]
Range (min … max): 459.9 ms … 469.8 ms 10 runs
Parent poster used the 5% value, so I did as well. But yes, to answer your question. Shown here on a Debian 11 system that's fairly old:
$ hyperfine "export LC_ALL=C; cut -d' ' -f1 fake_log.txt | sort -S5% | uniq -c | sort -rn -S5% | head"
Benchmark 1: export LC_ALL=C; cut -d' ' -f1 fake_log.txt | sort -S5% | uniq -c | sort -rn -S5% | head
Time (mean ± σ): 1.504 s ± 0.318 s [User: 2.833 s, System: 0.474 s]
Range (min … max): 0.942 s … 1.937 s 10 runs
$ hyperfine "export LC_ALL=C; cut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head"
Benchmark 1: export LC_ALL=C; cut -d' ' -f1 fake_log.txt | sort | uniq -c | sort -rn | head
Time (mean ± σ): 3.847 s ± 0.093 s [User: 4.165 s, System: 0.613 s]
Range (min … max): 3.591 s … 3.919 s 10 runs
Setting the buffer value to ~half the file size (100 MB) resulted in a mean time of 2.291 seconds. Setting it to the size of the file resulted in a mean time of 1.549 seconds, which is close enough to the 5% to call it equal - not like this server isn't busy with other stuff, so it's hardly a good place for perfect benchmarking.
Syscalls aren't free, nor are disk reads, so if you have the RAM to support slurping the entire file at once, it's sometimes faster.
I don't understand the downvotes. This is a fair criticism. The author even points out "programs as pipelines" which is literally the UNIX philosophy. There are tools that already exist on UNIX-likes more people should use instead of reaching for a script.
I can sympathize with the author w.r.t wanting to use a single language you like for everything. However, after decades I've found this to be untenable. There are languages that are just simply better for one-off scripting (Perl, Python), and languages that aren't (anything compiled). Trying to bolt an interpreter onto a compiled language from the outside seems like a lot of work for questionable gain.
> There are languages that are just simply better for one-off scripting (Perl, Python), and languages that aren't (anything compiled). Trying to bolt an interpreter onto a compiled language from the outside seems like a lot of work for questionable gain.
One reason is deployment. Writing code in python/node/etc... implies the ability of the production environment to bootstrap a rather complicated installation tree for the elaborate runtimes required by the code and all its dependencies. And so there are elaborate tools (npm, venv, Docker, etc...) that have grown up around those requirements.
Compiled languages (and Go in particular shines here) spit out a near-dependency-free[1] binary you can drop on the target without fuss.
I deal with this in my day job pretty routinely. Chromebooks have an old python and limited ability to pull down dependencies for quick test runs. Static test binaries make things a lot easier.
[1] Though there are shared libraries and runtime frameworks there too. You can't deploy a Gnome 3 app with the same freedom you can a TCP daemon, obviously.
> Compiled languages (and Go in particular shines here) spit out a near-dependency-free[1] binary you can drop on the target without fuss
I think it's more accurate to say "static binaries" instead of "compiled languages." The same headaches exist with dynamically linked, compiled binaries (and sometimes they're worse since you don't have a dependency manager, unless you add one)
> Static test binaries make things a lot easier
I think this really depends on your target environment and how much control you have over it. If you're in a Ruby or Python shop, for example, all your servers already have the stack installed. If you're targeting end user devices, those can have a huge mess of different configs to account for
Meh. In my experience version churn with shared library dependencies is pretty minor. You have to worry about and work at it if you're doing stuff like deploying a single binary across a bunch of different linux distros. But the straightforward case of "build on your desktop and copy the file up to the host" is a routine thing you can expect to work.
It's nothing like that with Python or Node. The rule there is that you get something working locally and then spend a while reverse engineering a pip/venv recipe or manifest or whatever to make it work somewhere else. It's decidedly non-trivial.
The 'scripting' vs 'compiled' language is a false dichotomy. Awk, Perl, Python are compiled programs. What makes a 'scripting' language special? Dynamic typing? Lack of compile step/delay?
I could imagine a lifetime of collecting scripting macros/libs in lisp to be as good or better.
However, the reason Bash is so prolific amongst Sys Admins such as myself is the fact that they are portable and reliable to use across Debian, Arch or RHEL based distributions.
You don't have to import extra libraries, ensure that you are running the proper python environment, or be certain that pip is properly installed and configured for whatever extra source code beyond what is included out of the box.
Bash is the most consistent code you can write to perform any task you need when you have to work with Linux.
Python is (at least in the CPython implementation) compiled, to python byte code which runs on the python virtual machine.
Its not compiled to native code. (Unless you use one of the compilers which do compile it to native code, though they tend to support only a subset of python.)
Another commenter beat me to it but still: sh / bash / zsh are quite fine up until certain complexity (say 500 lines), after which adding even a single small feature becomes a huge drag. We're talking hours for something that would take me 10 minutes in Golang and 15 in Rust.
I can actually agree with this take. Most of the opinions I've seen in this vein take some absurdly small limit, like 5 lines. 500, though? Yeah. My team rewrote a ProxySQL handler in Python because the bash version had gotten out of hand, and there were only a handful of people who could understand what it was doing. It passed 100% of spellcheck tests, and was as modular as it could possibly be, but modifying it was still an exercise in pain.
> portable and reliable to use across Debian, Arch or RHEL based distributions
Until you try to use a newer feature or try the script in a Mac or BSD or any older bash.
SH code is completely portable, but bash itself can have quite a few novel features. Don’t get me wrong - I’m happy the language is dynamic and still growing. But it can make things awkward when trying to use a script from a newer system on an older server (and the author has been “clever”).
> The 'scripting' vs 'compiled' language is a false dichotomy.
Not false, but perhaps in need of better definition. The term script has often denoted a trivial set of commands run by $interpreter.
"Scripting languages" have been seen as being in contrast to C, C++, Pascal, Java, SmallTalk, &c. The scripting languages remove from the user the need:
Closer to the truth is that static typing is a nuisance to a sole dev working in a short temporal period. Many successful startups get stuck with overgrown 'scripts' as platforms because it started as a one-man-programming shop.
I do have to add that Python more than any other language I've used results in working the first try that it's not surprising.
> The author even points out "programs as pipelines" which is literally the UNIX philosophy.
Yes, and if the thing I'm trying to do has a small input, it will only be done once, etc. I will often just pipe `grep` to `sort` or whatever, because it's less typing, it's generally clearer to a wider range of people, etc.
But on larger inputs, or even things like doing a single pattern inversion mixed with a pattern match, I like awk.
One reason the author could be doing this is to reduce dependencies. Maybe they deploy to Windows or to some other environment not guaranteed to have those utilities. Also testing probably gets simplified.
And every time I see things like that, I feel like the person must be unaware of perl.
I've made this point before, but I still find it hilarious. For more than a decade, awk was dead. Like, dead dead. There was nothing you could do in awk that wasn't cleaner and simpler and vastly more extensible in perl. And, yes, perl was faster than gawk, just like gawk is faster than shell pipelines.
Then python got big, people decided that they didn't want to use perl for big projects[1], and so perl went out of vogue and got dropped even for the stuff it did (and continues to do) really well. Then a new generation came along having never learned perl, and...
... have apparently rediscovered awk?
[1] Also the perl 5 tree stagnated[2] as all the stakeholders wandered off into the weeds to think about some new language. They're all still out there, AFAIK.
[2] Around 2000-2005, perl was The Language to be seen writing your new stuff in, so e.g. bioinformatics landed there and not elsewhere. But by 2015, the TensorFlow people wouldn't be caught dead writing perl.
Perl never recovered from its many ways to do things label. It's a tired criticism of the language but it's lodged in the brains of a generation of programmers which is unfortunate.
Also the classic sysadmin role which used to lean on Perl heavily sort of evolved with rise of The Cloud and automation tools like Chef, Puppet, and Ansible took over in that 2005-2015 time frame.
I am in the "awk > perl" camp. I think the idea of "vastly more extensible" is a negative for my scripting language, and "cleaner" just doesn't matter - I just want to write it the one time I want to use it and then be done with it. The awk language is really simple and quick to write.
By the way, I think this is why Perl lost to Python on larger scripting and programming projects - it's just easier to write (albeit harder to read, to antagoinze the Python lovers out there).
I learned perl around that time, and I thought it was awful. And just about everything about it: the parameter passing, the sigils that made BASIC look like Dijkstra's love child, the funky array/scalar coercion, and the bloody fact that it couldn't read from two files at once even though the docs suggested it should work. They didn't say so explicitly, because perl was pretty badly documented. My boss started writing object oriented perl, and that made perl unreadable even to perl experts.
AWK, on the other hand, is simplicity itself. Sure, it misses a few things, but for searching through log files or db dumps it's an excellent tool. And it's fast enough. If you really need much more speed, there are other tools, but I would rather rewrite it in C than try perl again.
They taught awk to my boy in bioinformatics as part of his degree. I was like Vito Corleone in the funeral home when he showed me the FASTA parsing awk code they were working on.
I mostly use awk over perl because awk is completely documented in one man page, so it's easy to see whether awk will be fit for purpose or whether I should write it using a real programming language. I learned Perl over a decade ago, but not the really concise dialect you would use on the command line for stuff I'd use awk for, and I've forgotten almost all of it now. At least with awk it's easy to relearn the functions I need when I need it.
Right, which is sort of my point. 20 years ago, "everyone" knew perl, at least to the extent of knowing the standard idioms for different environments that you're talking about. And in that world, "everyone" would choose perl for these tasks, knowing that everyone else would be expert enough to read and maintain them. Perl was the natural choice.
And in a world where perl is a natural choice for these tasks, awk doesn't have a niche. Because at the end of the day awk is simply an inferior language.
Which is the bit I find funny: we threw out and forgot about a great tool, and now we think that the ancestral toy it replaced is a good idea again.
That's a fair criticism. I know Perl can do pretty amazing things with text, but I've never bothered to learn it.
EDIT: I decided to ask GPT-4 to translate the gawk script to Perl. I make zero claims that this is ideal (as stated, I don't know Perl at all), but it _does_ produce the same output, but slightly slower than the gawk script.
$ hyperfine "perl -lane '\$ips{\$F[0]}++; END {print \"\$ips{\$_} \$_\" for (sort {\$ips{\$b} <=> \$ips{\$a}} keys %ips)[0..9]}' fake_log.txt"
Benchmark 1: perl -lane '$ips{$F[0]}++; END {print "$ips{$_} $_" for (sort {$ips{$b} <=> $ips{$a}} keys %ips)[0..9]}' fake_log.txt
Time (mean ± σ): 1.499 s ± 0.006 s [User: 1.447 s, System: 0.050 s]
Range (min … max): 1.490 s … 1.507 s 10 runs
Sample of one. I came of age on Linux in the late 90s/early 00s. Through other nerds on IRC channels I became familiar with Perl and didn't like it. I also picked up basic awk in the context of one-liners for shell pipelines and it was pretty nice for that. Easier to remember than the flags for cut and friends.
Learning awk a bit more deeply in recent years has been good too. I can write one liners that do more. I shipped a full awk script once, for something unimportant, but I would never do that again. For serious text munging these days I'd rather write a Rust program.
way to completely miss the point and turn this into a weird pissing competition (btw your "simple" awk example is super complicated and opaque to someone who doesn't have the awk man page open in front of them)
The script package looks really cool and I'll definitely try it out, cause honestly even though I do a lot of bash scripting it's super painful for anything but something super simple.
If someone doesn't know awk, then of course it'll be complicated and opaque - the same is true of practically any language. One-liners in general also tend to optimize for space. If you wanted it to be pretty-printed and with variable names that are more obvious:
{
PROCINFO["sorted_in"] = "@val_num_desc"
top_ips[$1]++
}
END {
counter = 0
for (i in top_ips) {
if (counter++ < 10) {
print top_ips[i], i
}
}
}
But also, if you read further up in the thread, you'll see that another user correctly identified the bottlenecks in the original pipeline, and applying those optimizations made it about 3x as fast as the awk one. Arguably, if you weren't familiar with the tools (and their specific implementations, like how GNU sort and BSD sort have wildly different default buffer sizes), you'd still be facing the same problem.
At least half of what people complain about with shell scripts can be solved by using ShellCheck [0], and understanding what it's asking you to do. I disagree with the common opinion of "anything beyond a few lines should be a Python script instead." If you're careful with variable scoping and error handling, bash is perfectly functional for many uses.
> If someone doesn't know awk, then of course it'll be complicated and opaque - the same is true of practically any language
I don't think this is true. Before I learned Go, I could follow along most Go programs pretty well, and learning Go well enough to get started took less than an hour. Every attempt I've made to learn more Awk, I've bounced off.
Really? I learned awk by watching a one hour youtube video one afternoon. It being a DSL really makes it super easy to learn, and this, to me suggests you probably haven't given it much time.
Good for you, their point still holds up. Languages like Python and Go are more readable than awk and bash. They are designed to be that way, and many many years of effort have been put into them for that specific purpose.
Whereas if you know awk and bash, then they can be incredibly useful in a pinch. It doesn’t knock how powerful they are. I think it is worth learning. But if something needs to be maintained then there is an argument for Python/Go/whatever.
Because it's really fast to iterate on if you know it, it's available basically everywhere and has no external dependencies, and you don't have to compile it.
I don't do a lot of shell scripting type things in Go because it's not a great language for it, but when I do, I take another approach, which is just to panic. Generics offer a nice little
func Must[T any](x T, err error) T {
if err != nil {
panic(err)
}
return x
}
which you can wrap around any standard "x, err :=" function to just make it panic, and even prior to generics you could wrap a "PanicOnErr(justReturnsErr())".
In the event that you want to handle errors in some other manner, you trivially can, and you're not limited to just the pipeline design patterns, which are cool in some ways, but limiting when that's all you have. (It can also be tricky to ensure the pipeline is written in a way that doesn't generate a ton of memory traffic with intermediate arrays; I haven't checked to see what the library they show does.) Presumably if I'm writing this in Go I have some other reason for wanting to do that, like having some non-trivial concurrency desire (using concurrency to handle a newline-delimited JSON file was my major use case, doing non-trivial though not terribly extensive work on the JSON).
While this may make some people freak, IMHO the real point of "errors as values" is not to force you to handle the errors in some very particular manner, but to make you think about the errors more deeply than a conventional exceptions-based program typically does. As such, it is perfectly legal and moral to think about your error handling and decide that what you really want is the entire program to terminate on the first error. Obviously this is not the correct solution for my API server blasting out tens of thousands of highly heterogeneous calls per second, but for a shell script it is quite often the correct answer. As something I have thought about and chosen deliberately, it's fine.
If you're not familiar with Go there is one detail missing from this post (though it's in the script README) - what a complete program looks like. Here's the example from https://github.com/bitfield/script#a-realistic-use-case
If one were actually going to use something like this, I’d think it’d be worth implementing a little shebang script that can wrap a single-file script in the necessary boilerplate and call go run!
The whole point of using Go is to explicitly handle errors as they happen. All of these steps can fail, but it’s not clear how they fail and if the next steps should proceed or be skipped on previous failures. This is harder to reason about, debug, and write than grep and bash.
is sufficient for a quick scripting hack designed to be run interactively.
I don't see it as a lot different to bash scripts with -e and pipefail set, which is generally preferable anyway.
Plenty of go code does
if err != nil {
return nil, err;
}
for each step and there are plenty of cases where you only care -if- it failed plus a description of some sort of the failure - if you want to proceed on some errors you'd split the pipe up so that it pauses at moments where you can check that and compensate accordingly.
(and under -e plus pipefail, "error reported to stdout followed by aborting" is pretty much what you get in bash as well, so I'm unconvinced it's actually going to be harder to debug)
From a technical point of view nothing prevents the scripting package to be just as informative with errors as bash and have a helper to log and clear the error. If it is not already the case, I call it a bug.
Maybe there could be a whole class of things that are like errors, but not as severe, and flags that deal with them as a group! We could call them "warnings"
If you choose to do that with warnings, that's your prerogative. I pretty much take the opposite approach personally; before putting code into review (or merging, if it's a personal project with no reviewers), every warning is inspected and I decide whether to fix it by changing the code not to generate it or manually suppress it only in that specific location. 99% of the time the warning is a sign of something actually wrong, but the 1% where I know that I actually don't care (and the extremely frequent occurrences when I don't while in the development part of the cycle and not finished with the implementation), it's much better to not have it completely block me.
I agree. When I am scripting, I want to have quick feed back loop. You can't really do that with Go because it doesn't have as good introspection and debugging capabilities as a scripting language like Ruby and doesn't have exceptions, which means that error handling is more verbose than necessary.
Also, I like being able to make modifications on the fly, so doing something in Ruby, I can just open the file make adjustments and I am done. With Go, I have to compile it and move it back into my path which is really tedious.
Shell scripting is quite fine up until certain complexity (say 500-1000 lines), after which adding even a single small feature becomes a huge drag. We're talking hours for something that would take me 10 minutes in Golang and 15 in Rust.
Many people love to smirk and say "just learn bash properly, duh" but that's missing the point that we never do big projects in bash so our muscle memory of bash is always kind of shallow. And by "we" I mean "a lot of programmers"; I am not stupid, but I have to learn bash's intricacies every time almost from scratch and that's not productive. It's very normal for things to slip up from your memory when you're not using them regularly. To make this even more annoying, nobody will pay me to work exclusively with bash for 3 months until it gets etched deep into my memory. So there's that too.
I view OP as a good reminder that maybe universal-ish tools to get most of what we need from shell scripting exist even today but we aren't giving them enough attention and energy and we don't make them mainstream. Though it doesn't help that Golang doesn't automatically fetch dependencies when you just do `go run random_script.go`: https://github.com/golang/go/issues/36513
I am not fixating on Golang in particular. But IMO next_bash_or_something should be due Soon™. It's not a huge problem to install a single program when provisioning a new VM or container either so I am not sure why are people so averse to it.
So yeah, nice article. I like the direction.
EDIT: I know about nushell, oilshell and fish but admittedly never gave them a chance.
The unix philosophy of having small programs that take in input, process it and return a result has proven to a success, I just never understood why the next logical step of having this program in library form never became a thing. I guess shells are a bit useful but not as useful as a decent repl (common-lisp or the jupyter repl) where these programs can be used as if they were a function.
Would love to use more golang- amazing build system and cross compiler built in. "All in one" binaries are the best thing ever. I adore most of the ideas in the language.
.... but there are just soooo many little annoyances / inconveniences which turn me off.
- No Optional Parameters. No Named Parameters. Throw us a bone Rob Pike, it's 2023. Type inferred composite literals may be an OK compromise.. if we ever see them: https://github.com/golang/go/issues/12854
- Unused import = will not compile. Unused variable = Will not compile. Give us the ability to turn off the warning.
- No null safe or nullish coalescing operator. (? in rust, ?? in php, etc.)
- Verbosity of if err != nil { return err; }
- A ternary operator would be nice, and could bring if err != nil to 1 line.
- No double declarations. “no new variables on left side of :=” .. For some odd reason “err” is OK here... Would be highly convenient for pipelines, so each result doesn't need to be uniquely named.
I'd describe Go as a "simple" language- Not an "easy" language. 1-2 lines in Python is going to be 5-10 lines in golang.
Keep in the Go ecosystem, retain compatibility with the Go programs you already have, but have a much more concise scripting capability at your disposal.
When this happens, one can split access.log into pieces, process separately then recombine.
But that's more or less what sort(1) does with large files, creating temporary files in $TMPDIR or other user-specified directory after -T if using GNU sort.
There was a way to eliminate duplicate lines from an unordered list using k/q, without using temporary files but I stopped using it after Kx, Inc. was sold off and I started using musl exclusively. q requires glibc.
I have been thinking that JS template literals could be a great replacement for shell programming, allowing you to make more powerful syntax to emulate a lot of bash useful things while still having a lot of a proper programming language power
Interesting. I do something similar with my task https://github.com/kardianos/task package, which is in tern loosely based off of another package from 10-15 years ago.
That sounds interesting, but the package is unfortunately undocumented. I tried https://pkg.go.dev/github.com/kardianos/task, but that doesn't help me understand it either. It's missing a high level explanation of what to use it for, its limits and some decent examples.