This is a great article and a fun read. A friend sent it over to me and I wrote him some notes about my thoughts. Having since realised it's on HN, I thought I'd post them here as well.
Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)
-----
Classic system sizing problem!
1.75GiB will fit in the page cache on any machine with >2GiB RAM.
One of the big problems is that people really don't know what to expect
so they don't realise that their performance is orders of magnitude
lower than it "should" be.
Part of this is because the numbers involved are really large:
1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number.
2,600,000,000 times per second (2.6GHz) is an incomprehensibly large
number of things that can be done in the blink of an eye.
...But if you divide them using simple units analysis; things per second
divided by things gives you seconds: 1.383. That's assuming that you can
process 1 byte per clock cycle which might be reasonable if the data is
small and fits in cache. If we're going to be doing things in streaming
mode then we'll be limited by memory bandwidth, not clock speed.
Which means we should be able to squeeze our data through the processor
in...
bytes per second divided by bytes = seconds =>
>>> 8004829184 / 1879048192.0
4.260044642857143
so less than 5 seconds.
We probably want to assume that there are other stalls and overheads,
but a number between 20 and 60 seconds seems reasonable for that
workload (he gets 12): the article says it's just a summary plus
aggregate workload so we don't really need to allocate much in the way
of arithmetic power.
As with most things in x86, memory bandwidth is usually the bottleneck.
If you're not getting numbers with an order of magnitude or so of memory
bandwidth then either you have a arithmetic workload (and you know it)
or you have a crap tool.
Due to the memory fetch patterns and latencies on x86, it's often
possible to reorder your data access to get a nominal arithmetic
workload close to the memory bandwidth expected speed.
His analysis about the parallelisation of the workload due to shell
commands is incorrect. The speedup comes from accessing the stuff
straight from the page cache.
His analysis about loading the data into memory on Hadoop is incorrect.
The slowdown in Hadoop probably comes from memory copying, allocation
and GC involved in transforming the raw data from the page cache into
object in the language that Hadoop is written in and then throwing them
away again. That's just a guess because you want memory to fill up (to
about 1.75GiB) so that you don't have to go to disk. That memory is held
by the OS rather than the userland apps tho'.
His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect.
They're "done at the same time" because sleep closes stdout immediately
rather than at the end of the three seconds. With a tool like uniq or
sort it has to ingest all the data before it can begin because that's
the nature of the algorithm. A tool like cat will give you line-by-line
flow because it can but the pipeline is strictly serial in nature and
(as with uniq or sort), might stall in certain places.
He claims that the processing is "non-IO-bound" but also encourages the
user to clear the page cache. Clearing the page cache forces the
workload to be IO bound by definition. The page cache is there to "hide"
the IO bottlenecks where possible. If you're doing a few different
calculations using a few different pipelines then you want the page
cache to remain full as it will mean that the data doesn't have to be
reread from disk for each pipeline invocation.
For example, when I ingest photos from my CF card, I use "cp" to get the
data from the card to my local disk. The card is 8GiB. I have 16GiB of
RAM. That cp usually reads ahead almost the whole card and then
bottlenecks on the write part of getting it onto disk. That data then
sits in RAM for as long as it can (until the memory is needed by
something else) which is good because after the "cp" I invoke md5sum to
calculate some checksums of all the files. This is CPU bound and runs
way faster than it would if it was IO bound due to having to reread all
that data from disk. (My arrangement is still suboptimal but this gives
an example of how I can get advantages from the architecture without
having to do early optimisations in my app: my ingest script is "fast
enough" and I can just about afford to do the md5sum later because I can
be almost certain it's going to use the exact same data that was read
from the card rather than the copied data that is reread from disk and,
theoretically, might read differently.)
He's firmly in the realm of "small data" by 4 or 5 base 10 orders of
magnitude (at least) so he's nowhere close to getting a "scaling curve"
that will tell him where best to optimise for the general case. When he
starts getting to workloads 2 or 3 orders of magnitude bigger than what
he's doing he might find that there are a certain class of optimisations
that present themselves but that probably won't be the "big data"
general case.
Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).
Through his use of xargs he implies (but does not directly acknowledge)
that he realises this is a so-called "embarrassingly parallel" problem.
-----
Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)
----- Classic system sizing problem!
1.75GiB will fit in the page cache on any machine with >2GiB RAM.
One of the big problems is that people really don't know what to expect so they don't realise that their performance is orders of magnitude lower than it "should" be.
Part of this is because the numbers involved are really large: 1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number. 2,600,000,000 times per second (2.6GHz) is an incomprehensibly large number of things that can be done in the blink of an eye.
...But if you divide them using simple units analysis; things per second divided by things gives you seconds: 1.383. That's assuming that you can process 1 byte per clock cycle which might be reasonable if the data is small and fits in cache. If we're going to be doing things in streaming mode then we'll be limited by memory bandwidth, not clock speed.
http://www.techspot.com/review/266-value-cpu-roundup/page5.h... reckons that the Intel Core 2 Duo E7500 @ 2.93GHz) has 7,634MiB/s of memory bandwidth for reads.
That's 8,004,829,184 bytes per second.
Which means we should be able to squeeze our data through the processor in...
bytes per second divided by bytes = seconds =>
>>> 8004829184 / 1879048192.0 4.260044642857143
so less than 5 seconds.
We probably want to assume that there are other stalls and overheads, but a number between 20 and 60 seconds seems reasonable for that workload (he gets 12): the article says it's just a summary plus aggregate workload so we don't really need to allocate much in the way of arithmetic power.
As with most things in x86, memory bandwidth is usually the bottleneck. If you're not getting numbers with an order of magnitude or so of memory bandwidth then either you have a arithmetic workload (and you know it) or you have a crap tool.
Due to the memory fetch patterns and latencies on x86, it's often possible to reorder your data access to get a nominal arithmetic workload close to the memory bandwidth expected speed.
His analysis about the parallelisation of the workload due to shell commands is incorrect. The speedup comes from accessing the stuff straight from the page cache.
His analysis about loading the data into memory on Hadoop is incorrect. The slowdown in Hadoop probably comes from memory copying, allocation and GC involved in transforming the raw data from the page cache into object in the language that Hadoop is written in and then throwing them away again. That's just a guess because you want memory to fill up (to about 1.75GiB) so that you don't have to go to disk. That memory is held by the OS rather than the userland apps tho'.
His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect. They're "done at the same time" because sleep closes stdout immediately rather than at the end of the three seconds. With a tool like uniq or sort it has to ingest all the data before it can begin because that's the nature of the algorithm. A tool like cat will give you line-by-line flow because it can but the pipeline is strictly serial in nature and (as with uniq or sort), might stall in certain places.
He claims that the processing is "non-IO-bound" but also encourages the user to clear the page cache. Clearing the page cache forces the workload to be IO bound by definition. The page cache is there to "hide" the IO bottlenecks where possible. If you're doing a few different calculations using a few different pipelines then you want the page cache to remain full as it will mean that the data doesn't have to be reread from disk for each pipeline invocation.
For example, when I ingest photos from my CF card, I use "cp" to get the data from the card to my local disk. The card is 8GiB. I have 16GiB of RAM. That cp usually reads ahead almost the whole card and then bottlenecks on the write part of getting it onto disk. That data then sits in RAM for as long as it can (until the memory is needed by something else) which is good because after the "cp" I invoke md5sum to calculate some checksums of all the files. This is CPU bound and runs way faster than it would if it was IO bound due to having to reread all that data from disk. (My arrangement is still suboptimal but this gives an example of how I can get advantages from the architecture without having to do early optimisations in my app: my ingest script is "fast enough" and I can just about afford to do the md5sum later because I can be almost certain it's going to use the exact same data that was read from the card rather than the copied data that is reread from disk and, theoretically, might read differently.)
He's firmly in the realm of "small data" by 4 or 5 base 10 orders of magnitude (at least) so he's nowhere close to getting a "scaling curve" that will tell him where best to optimise for the general case. When he starts getting to workloads 2 or 3 orders of magnitude bigger than what he's doing he might find that there are a certain class of optimisations that present themselves but that probably won't be the "big data" general case.
Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).
Through his use of xargs he implies (but does not directly acknowledge) that he realises this is a so-called "embarrassingly parallel" problem. -----