MapReduce came along at a moment in time where going horizontal was -essential-....

zzbn00 · on Jan 26, 2024

For the Map Reduce specifically the one of the big issues was the speed at which you could read data from a HDD and transfer across the network. The MapReduce paper benchmarks were done with computers with 160 GB HDDs (so 3x smaller than typical NVMe SSD today) which had sequential read of maybe 40MB/s (100x smaller than a NVMe Drive today!) and random reads of <1MB/s (also very much smaller than a NVMe Drive today).

On the other hand they had 2GHz Xeon CPUs!

Table 1 in the paper suggests that average read throughput per worker for typical jobs was around 1MB/s.

jonstewart · on Jan 27, 2024

Maybe more like 80MB/s? But yeah, good point, sequential reads were many times faster than random access, yet on a single disk the sequential transfer rate increases were still not keeping up with storage rate increases, nor CPU speed increases. MapReduce/Hadoop gave you a way to have lots of disks operating sequentially.

nickpsecurity · on Jan 26, 2024

I’ll add context that NUMA machines with high CPU’s and RAM used to cost six to seven digits. Some setups were eight figures. They had proprietary software, too.

People came up with horizontal scaling across COTS hardware, often called Beowulf clusters, to have more computing for cheaper. They’d run UNIX or Linux with a growing collection of open-source tools. They’d be able to get the most out of their compute by customizing it to their needs.

So, vertical scaling being exorbitantly expensive and less flexible at the time, too.