In the final solution at the end of the article there are only two pipes:
1. A pipe to feed the file names into xargs for starting up parallel `mawk` processes.
2. A pipe to a final `mawk` process which aggregates the data from the parallel processes.
There's still some performance that could be gained by using a single processes with threads and shared memory, but this is pretty good for something that can be whipped together quickly.
Yeah its not bad. In the final command, it is basically leveraging mawk for everything which works out well since there's fewer pipes.
But in this case its about replacing hadoop with mawk basically. Which is indeed a good point as well - and incidentally also confirms my own comment =)
1. A pipe to feed the file names into xargs for starting up parallel `mawk` processes.
2. A pipe to a final `mawk` process which aggregates the data from the parallel processes.
There's still some performance that could be gained by using a single processes with threads and shared memory, but this is pretty good for something that can be whipped together quickly.