About ten years ago I got my hands on some of the last production FusionIO SLC cards for benchmarking. The software was an in-memory database that a customer wanted to use with expanded capacity. I literally just used the fusion cards as swap.
After a few minutes of loading data, the kernel calmed down and it worked like a champ. Millions of transactions per second across billions of records, on a $500 computer... and a card that cost more than my car.
Definitely wouldn't do it that way these days, but it was an impressive bit of kit.
I worked at a place where I can say, FusionIO saved the company. W e had a single Postgres database which powered a significant portion of the app. We tried the kick off a horizontal scale project to little success around it - turns out that partitioning is hard on a complex, older codebase.
Somehow we end up with a FusionIO card in tow. We go from something like 5,000 read QPS to 300k reads QPS on pgbench using the cheapest 2TB card.
Ever since then, I’ve always thought that reaching for vertical scale is more tenable than I originally thought. It turns out hardware can do a lot more than we think.
The slightly better solution for these situations is to set up a reverse proxy that sends all GET requests to a read replica and the server with the real database gets all of the write traffic.
But the tricky bit there is that you may need to set up the response to contain the results of the read that is triggered by a successful write. Otherwise you have to solve lag problems on the replica.
You can get up to, I think, half a thousand cores in a single server, with multiple terabytes of RAM. You could run the entirety of Wikipedia's or Stack Overflow's o Hacker News's business logic in RAM on one server, though you'd still want replicas for bandwidth scaling and failover. Vertical scaling should certainly get back in vogue.
Not to mention that individual servers, no matter how expensive, cost a tiny fraction of the equivalent cloud.
Remember the LMAX Disruptor hype? Their pattern was essentially to funnel all the data for the entire business logic onto one core, and make sure that core doesn't take any bullshit - write the fastest L1-cacheable nonblocking serial code with input and output in ring buffers. Pipelined business processes can use one core per pipeline stage. They benchmarked 20 million transactions per second with this pattern - in 2011. They ran a stock exchange on it.
Back when the first Intel SSDs were coming out, I worked with an ISP that had an 8 drive 10K RAID-10 array for their mail server, but it kept teetering on the edge of not being able to handle the load (lots of small random IO).
As an experiment, I sent them a 600GB Intel SSD in laptop drive form factor. They took down the secondary node, installed the SSD, and brought it back up. We let DRBD sync the arrays, and then failed the primary node over to this SSD node. I added the SSD to the logical volume, then did a "pvmove" to move the blocks from the 8 drive array to the SSD, and over the next few hours the load steadily dropped down to nothing.
It was fun to replace 8x 3.5" 10K drives with something that fit comfortably in the palm of my hand.
After a few minutes of loading data, the kernel calmed down and it worked like a champ. Millions of transactions per second across billions of records, on a $500 computer... and a card that cost more than my car.
Definitely wouldn't do it that way these days, but it was an impressive bit of kit.