> incl. an acid db. Why would you want to use a database for this problem? The i...

_delirium · on Jan 19, 2015

If it really is just a one-shot with one simple-ish filter, I agree. But I often find myself incrementally building shell-pipeline tangles that are sped up massively by being replaced with SQLite. Once your processing pipeline is making liberal use of the sort/grep/cut/tee/uniq/tac/awk/join/paste suite of tools, things get slow. The tangle of Unix tools effectively does repeated full-table scans without the benefit of indexes, and is especially bad if you have to re-sort the data at different stages of the pipeline, e.g. on different columns, or need to split and then re-join columns in different stages of the pipeline. In that kind of scenario a database (at least SQLite, haven't tried a more "heavyweight" database) ends up being a win even for stream-processing tasks. You pay for a load/index step up front, but you more than get it back if the pipeline is nontrivial.

zobzu · on Jan 19, 2015

The interesting part is that its still faster, not that its the best-case solution. The main reason is that the data set fits in memory and is no slower to load (you need to read the data in all cases, duh. Both piped and db will read the data from disk exactly once in a sequential fashion).

There is no locking issue, and you can be smart in the filtering steps (most dbs do some of that automagically anyway). You don't have that level of control with the pipes, you are limited by the program's ability to process stdin, and additional locking.

This is exactly where knowing how things really work under the hood give you an advantage vs "but in theory..". You can reimplement a complete program, or even set of programs that will outperform the db abd the piped example. But will you? No, you want the best balance between fastest solution with the least amount of work.