Our colo went down. Fire system triggered it (no actual fire was harmed in the triggering, however). On Thanksgiving. When I'd volunteered for on-call seeing as my co-worker had family to attend to and I didn't.
He thanked me afterward.
Our cage was reasonably straightforward to bring up, once power was restored. The colo facility as a whole took a few days to bring all systems up, apparently some large storage devices really don't like being Molly-switched.
Yeah. So estimate the cost there are like $100-250k? I'm willing to accept a pretty low risk to my life for ~15 minutes of searching to save my company $250k. It's a risk on the order of riding a motorcycle from Oakland to San Jose in rush hour, I'd roughly estimate.
"I can totally reach into the back of this gigantic, flesh eating machine to move that widget a little bit to the right. Management might give me a raise for saving them money!" -famous last words of a former factory worker.
"You wouldn't download a new lung, would you?" (RIAA ad in 2033). I lived in a tent 200m downwind of a 24x7 burning trash dump on a former Iraqi and then USAF military base, for some time, so I think I've got particulate risk checked already.
I think I actually hope my odds of eventually getting cancer are ~100% over my lifespan, because it seems to be a natural consequence of living long enough. I also hope that by the time I have cancer of any size, it is something you can treat fairly successfully.
As long as they treat cancer as a profit center, a cure will never surface in the US of A. "Treament" is a multi billion (trillion?) business, a cure would reduce that to dust.
Also, you should look up the agony, people would rather shoot themselves than take the "treatment"
I don't think the situation was that bad. The one really unforgivable thing was shoddy electrical work in shower trailers (I think ~10 contractors and soldiers were fatally electrocuted while showering while in Iraq! I certainly got 230v a couple times and went through the reporting process, and actually got MPs and a friend from Contracting to turn it into a bigger issue.)
I do that more days than not and I haven't been scraped off the pavement yet. I think most commenters here are overly concerned with the risk because they haven't properly equipped themselves to deal with it. It's much easier to keep yourself out of a bodybag when you are aware of your surroundings.
Spot on. We were rudely awoken at 3AM by our alert system after one of our DC's caught fire (host Europe/123-reg in nottingham - utter fucking cowboys now moved on from there). UPS blew and took out the entire power system and generator.
The number of colo issues I've seen triggered by various backup/redundant systems is pretty impressive.
Whether it was a redundant mains power system blowing (taking down the main PDU), spoiled diesel, failed generator cutover, UPS fire, smoke detector-triggered shutdown (associated with power management), a really bizarre IPV6 ping / router switch flapping issue, load balancer failures based on an SSL cipher-implementation bug (triggered an LB reboot and ~15s outage at random intervals), etc., etc., etc.
Just piling redundancy on your stack doesn't make it more reliable. You've got to engineer it properly, understand the implications, and actually monitor and come to know the actual outcomes. Oh, and cascade failures.
> Just piling redundancy on your stack doesn't make it more reliable.
Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go. As an example, the rate of engine failure in twin-engine planes is greater than for single-engine planes. It's obvious if you think about it: there are now two points of failure instead of one. Why have two-engined planes? Because you can still fly on one engine (pilots: no nitpicking!).
What redundancy does do is let you recover from failure without catastrophe (provided you've set it up properly as per the parent).
> Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go.
It depends on what you're protecting against, how you're protecting against it, and how you've deployed those defenses.
Chained defenses, generally, decrease reliability. Parallel defenses generally increase it.
E.g.: Putting a router, an LB, a caching proxy, an app server tier, and a database back-end tier (typical Web infrastructure) in series (a chain) introduces complexity and SPOFs to a service. You can duplicate elements at each stage of the infrastructure, but might well consider a multi-DC deployment, as you're still subject to DC-wide outages (I've encountered several of these) and a great deal of complexity and cost.
Going multi-DC doesn't increase capital requirements by much, and may or may not be more expensive than 2x the build-out in a single DC. It though raises issues of code and architecture complexity.
In several cases, we were experiencing issues that would have pervaded despite redundant equipment. E.g.: the load balancer SSL bug we encountered was present on all instances of multiple generations of the product. Providing two LBs would simply have insured that as the triggering cipher was requested, both LBs would have failed and rebooted. Something of an end-run around our Maginot line, as it were.
Our colo went down. Fire system triggered it (no actual fire was harmed in the triggering, however). On Thanksgiving. When I'd volunteered for on-call seeing as my co-worker had family to attend to and I didn't.
He thanked me afterward.
Our cage was reasonably straightforward to bring up, once power was restored. The colo facility as a whole took a few days to bring all systems up, apparently some large storage devices really don't like being Molly-switched.