I'd say this is more relevant now than in 2008. With SSDs and 100Gbit network connections, it's a lot easier to saturate a CPU with data. And a lot of cloud providers now recommend colocating computing such that it's as close to the data as possible.
It's both easier and harder. Easier in that raw storage bandwidth has increased faster than single core performance. Harder in that to access that bandwidth you have to optimize that latency for that bandwidth has not decreased, in fact in cloud it's increased. I.e. cloud storage can be extremely fast, but only if it's multi stream storage. Similarly cloud compute can be fast, but only if it's multi thread CPU.
But the main point of (2008) is that a lot of the links are broken and tools mentioned outdated. The general concept "you can probably optimize your storage access" is still true though but usually any such concept is, it's the details around current caveats that are interesting.
This is what's really interesting to me right now. A 1 TB 980 Pro SSD (I know, not a "data center" SSD, but it's the most recent Gen4 SSD I've seen announced) can handle ~300MB/s [1] of random writes at realistic queue depths up to 32. Previously with 10 Gbit (1.25GB) connection, 4 of those SSDs could handle that incoming write traffic, which is no sweat and can actually physically fit on a server. But as you go to 100 Gbit (12.5GB) of incoming random writes to a storage server, something's gotta give. At this level of traffic, you'd need 41 of those SSDs to be able to handle that incoming traffic without a write buffer on the server. This is a simplification of a very complex topic, but it's what I/O bound means to me, at least.
Note that the numbers presented in that review are for single-threaded benchmarks, when the current best practices for a server application that really has a lot of write traffic would be to split the work over multiple threads, and preferably use io_uring for both network and disk IO. Even on those consumer-grade SSDs, that would allow for much higher throughput than what's presented in that review—though most of those consumer-grade SSDs wouldn't be able to sustain such write performance for very long.
Right now, a lot of software that wasn't written specifically with SSDs in mind is effectively bottlenecked on the CPU side with system call overhead, or is limited by the SSD's latency while using only a fraction of its potential throughput because the drive is only given one or a few requests to work on at a time.
Wow, the author himself! Thanks for the great details. I missed the detail in the article about single threaded testing, so thanks for clarifying. I definitely see the specs for the Samsung drive show higher peak random IOPS but I'm always trying to look past the marketing hype and see what's more realistic. What allows an enterprise-grade SSD to sustain writes longer than a consumer-grade SSD? I assume it's cache related (SLC or DRAM), but I'm still learning.
Enterprise SSDs often (but not always) reserve more spare area out of the pool of flash on the drive, which helps a lot with sustained write speeds, especially random writes; the garbage collector is under less pressure to free up erase blocks for new writes.
Consumer SSDs use SLC caching which allows for much higher write speeds until that cache runs out, but then the drive is stuck with the job of cleaning up that cache while also still handling new writes. So for the same amount of raw NAND capacity and same usable capacity, an enterprise drive that doesn't do any SLC caching at all will tend to have a better long-term write speed than the post-cache write speed on a consumer drive.
Once you factor the storage latency to/through the server the IOPS at realistic queue depths fall by an enormous factor. This is usually balanced by putting tons of servers on it at which point 41 NVMe drives isn't really a lot - will usually fit in a single rack mount storage enclosure.
Bandwidth isn't the problem, latency is. Each server could have 400 terabit interfaces and you'll still have garbage iops compared to local pcie on a laptop for reasonable depth.
Can you help me understand why latency kills IOPS? There's lots of work showing NVMe over TCP bringing similar performance and latency from a remote server as compared to a local SSD. Here's an example:
Interestingly for most people the network is just as slow today as 10 years ago beacuse 10/100 GBe has stalled in the datacenters and the trickling down that used to happen seems to have stopped. (Gigabit ethernet was 10 years old already in 2008)