Spot on. We were rudely awoken at 3AM by our alert system after one of our DC's ...

dredmorbius · on April 5, 2013

The number of colo issues I've seen triggered by various backup/redundant systems is pretty impressive.

Whether it was a redundant mains power system blowing (taking down the main PDU), spoiled diesel, failed generator cutover, UPS fire, smoke detector-triggered shutdown (associated with power management), a really bizarre IPV6 ping / router switch flapping issue, load balancer failures based on an SSL cipher-implementation bug (triggered an LB reboot and ~15s outage at random intervals), etc., etc., etc.

Just piling redundancy on your stack doesn't make it more reliable. You've got to engineer it properly, understand the implications, and actually monitor and come to know the actual outcomes. Oh, and cascade failures.

Nick_C · on April 6, 2013

> Just piling redundancy on your stack doesn't make it more reliable.

Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go. As an example, the rate of engine failure in twin-engine planes is greater than for single-engine planes. It's obvious if you think about it: there are now two points of failure instead of one. Why have two-engined planes? Because you can still fly on one engine (pilots: no nitpicking!).

What redundancy does do is let you recover from failure without catastrophe (provided you've set it up properly as per the parent).

dredmorbius · on April 6, 2013

> Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go.

It depends on what you're protecting against, how you're protecting against it, and how you've deployed those defenses.

Chained defenses, generally, decrease reliability. Parallel defenses generally increase it.

E.g.: Putting a router, an LB, a caching proxy, an app server tier, and a database back-end tier (typical Web infrastructure) in series (a chain) introduces complexity and SPOFs to a service. You can duplicate elements at each stage of the infrastructure, but might well consider a multi-DC deployment, as you're still subject to DC-wide outages (I've encountered several of these) and a great deal of complexity and cost.

Going multi-DC doesn't increase capital requirements by much, and may or may not be more expensive than 2x the build-out in a single DC. It though raises issues of code and architecture complexity.

In several cases, we were experiencing issues that would have pervaded despite redundant equipment. E.g.: the load balancer SSL bug we encountered was present on all instances of multiple generations of the product. Providing two LBs would simply have insured that as the triggering cipher was requested, both LBs would have failed and rebooted. Something of an end-run around our Maginot line, as it were.