A million times this. If turning a whole datacenter off is catastrophic for your business, you've not done your risk management properly. Single events that take down datacenters happen. Not being prepared for them is unforgivable.
Even with DR, it has a cost. Especially if you're a colo provider, powering off your whole facility (or, ideally, at least a room) to find smoke is going to have cost, even if all your customers have DR plans.
In most datacenters I've seen, I'd probably be willing to do a run through with IR cam/temp probe, or just visual inspection, with a handheld 1211, especially if I had a respirator, if it were just "smell of smoke". Clear view and path to two exits, someone at the EPO switch, etc.
The "big scary things" are battery plant and generator plant, and any kind of subfloor or ductwork. As long as the fire isn't in any of those, it's far less of a big deal. I probably wouldn't EPO a room for a server on fire, either -- just kill the rack, which takes slightly longer.
I've been in places where "smell of smoke" was a fucknozzle smoking a cigar or burning leaves outdoors outside an air intake, and another where it was a smoker's coat being put on an air handler.
The great thing about DR plans is... when implemented correctly no one should have to risk their life avoid using it. He didn't say that one of his servers was running a little hot (which happens, a lot). He said there was smoke and the acrid smell of something burning. Which means that one of his components actually got hot enough to ignite.
If you're not ready to use your DR plan it probably means your DR plan is inadequate to begin with. Why the hell do fire drills? Even cruise ships do drills. God forbid they pull their passengers away from that very important game of Texas Hold'em.
I probably wouldn't EPO a room for a server on fire, either -- just kill the rack, which takes slightly longer.
You fail to understand how fires start or why they spread. I mean why the hell do datacenters spend millions of dollars on fire supression when an IR cam and a handheld extenguisher is just as good, right?
Essentially no one does "EPO drills" on their datacenters. Particularly in multi-tenant environments like commercial colocation centers. It's quite reasonable for your DR plan to involve a $200k+ cost per EPO pull or DR failover. Your business should have DR provisions, and you should test the DR plan, but it's probably not reasonable (or legal) to do a full test involving dumping agent, rapid power off, etc.
The fire suppression exists for two reasons. One, is to get code exemptions to be allowed to run wiring in ways which would otherwise require licensed electricians to do every wiring job, and prohibit people from being in the facility. Two is to detect small fires early, and to prevent their spread, as well as to protect facilities from catastrophic facility-wide fire.
Servers are just not that high a fire risk, particularly when de-energized. Generally inside a self-contained metal chassis, less than 100 pounds each, metal/plastic, etc. The power supply is the most likely component to start a fire, and contains a max of maybe 250g of capacitors and other components. The risk of one server catching on fire is low, and the risk of it rapidly spreading to anything else is low, so yes, I'd be comfortable pulling a single burning server out of a de-energized rack.
Also, in big or purpose-built facilities, those components most likely to be fire risks (batteries/power handlers, and generators) are in separate rooms, separated by firewalls from the datacenter. A fire in the battery room is going to be dealt with by sealing that room and powering it off, dumping suppression agent, and bringing out the FD immediately.
Life safety is much more important than business continuity, but a lot of people have jobs where they accept a non-zero risk of physical harm to do their jobs. It's certainly not reasonable to demand a datacenter tech go into a burning building to rescue a database server or something, but approximately zero datacenter staff I know would have a problem with assuming the level of risk I would to find problems. (it's probably a bigger deal for employers to actually discourage risk-taking by employees, particularly when it's risk-taking to save themselves effort, like single-person racking large UPSes or very large servers, etc.)
I think we are aiming at the same thing here. A proper, multi-tenant datacenter will have separate zones for generators, UPS, electrical, and climatisation. The actual chance of a fire starting and spreading in this type of configuration is low and this is the environment I prefer to work in. I've also worked in server rooms in 100+ year old buildings which did double duty as storage/broom closet. The original post was closer to this since they had racked UPSes next to their servers and network equipment. It apparently caused enough smoke to fill a server room and make the poster nauseous, which makes me wonder what air handling capacity they have. It's this type of "datacenter" where you have to worry about your life.
Even cruise ships do drills. God forbid they pull their passengers away from that very important game of Texas Hold'em.
Well, US flag passenger ships (among others) are required to hold Fire and Emergency drills at least once every week. But your point stands: it's only by having a plan, and executing that plan even if it's prefaced by "this is a drill..." is crucial if you want to have a hope of things going the right way in an actual emergency.
Here's the thing, though. If your people are properly trained in what to do, and how to use the equipment, then shutting down power to the rack and extinguishing fire on a single server with a fire extinguisher may be a reasonable course of action. But that contingency should have been considered ahead of time and be part of the emergency plan.
The time to decide what to do is not during the emergency.
As for walking around a smoky room looking for the source, that's nuts. I spent one long day (way too short, though) time at Military Sealift Command Firefighting School. One of the first things they do is put you in a room full of smoke and make you count out loud. After about 30 seconds you feel like you're going to pass out -- that gets the point across much better than lectures ever will.
Our colo went down. Fire system triggered it (no actual fire was harmed in the triggering, however). On Thanksgiving. When I'd volunteered for on-call seeing as my co-worker had family to attend to and I didn't.
He thanked me afterward.
Our cage was reasonably straightforward to bring up, once power was restored. The colo facility as a whole took a few days to bring all systems up, apparently some large storage devices really don't like being Molly-switched.
Yeah. So estimate the cost there are like $100-250k? I'm willing to accept a pretty low risk to my life for ~15 minutes of searching to save my company $250k. It's a risk on the order of riding a motorcycle from Oakland to San Jose in rush hour, I'd roughly estimate.
"I can totally reach into the back of this gigantic, flesh eating machine to move that widget a little bit to the right. Management might give me a raise for saving them money!" -famous last words of a former factory worker.
"You wouldn't download a new lung, would you?" (RIAA ad in 2033). I lived in a tent 200m downwind of a 24x7 burning trash dump on a former Iraqi and then USAF military base, for some time, so I think I've got particulate risk checked already.
I think I actually hope my odds of eventually getting cancer are ~100% over my lifespan, because it seems to be a natural consequence of living long enough. I also hope that by the time I have cancer of any size, it is something you can treat fairly successfully.
As long as they treat cancer as a profit center, a cure will never surface in the US of A. "Treament" is a multi billion (trillion?) business, a cure would reduce that to dust.
Also, you should look up the agony, people would rather shoot themselves than take the "treatment"
I don't think the situation was that bad. The one really unforgivable thing was shoddy electrical work in shower trailers (I think ~10 contractors and soldiers were fatally electrocuted while showering while in Iraq! I certainly got 230v a couple times and went through the reporting process, and actually got MPs and a friend from Contracting to turn it into a bigger issue.)
I do that more days than not and I haven't been scraped off the pavement yet. I think most commenters here are overly concerned with the risk because they haven't properly equipped themselves to deal with it. It's much easier to keep yourself out of a bodybag when you are aware of your surroundings.
Spot on. We were rudely awoken at 3AM by our alert system after one of our DC's caught fire (host Europe/123-reg in nottingham - utter fucking cowboys now moved on from there). UPS blew and took out the entire power system and generator.
The number of colo issues I've seen triggered by various backup/redundant systems is pretty impressive.
Whether it was a redundant mains power system blowing (taking down the main PDU), spoiled diesel, failed generator cutover, UPS fire, smoke detector-triggered shutdown (associated with power management), a really bizarre IPV6 ping / router switch flapping issue, load balancer failures based on an SSL cipher-implementation bug (triggered an LB reboot and ~15s outage at random intervals), etc., etc., etc.
Just piling redundancy on your stack doesn't make it more reliable. You've got to engineer it properly, understand the implications, and actually monitor and come to know the actual outcomes. Oh, and cascade failures.
> Just piling redundancy on your stack doesn't make it more reliable.
Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go. As an example, the rate of engine failure in twin-engine planes is greater than for single-engine planes. It's obvious if you think about it: there are now two points of failure instead of one. Why have two-engined planes? Because you can still fly on one engine (pilots: no nitpicking!).
What redundancy does do is let you recover from failure without catastrophe (provided you've set it up properly as per the parent).
> Yeah, in a sense it actually makes it less reliable as far as mean-time-between-failures go.
It depends on what you're protecting against, how you're protecting against it, and how you've deployed those defenses.
Chained defenses, generally, decrease reliability. Parallel defenses generally increase it.
E.g.: Putting a router, an LB, a caching proxy, an app server tier, and a database back-end tier (typical Web infrastructure) in series (a chain) introduces complexity and SPOFs to a service. You can duplicate elements at each stage of the infrastructure, but might well consider a multi-DC deployment, as you're still subject to DC-wide outages (I've encountered several of these) and a great deal of complexity and cost.
Going multi-DC doesn't increase capital requirements by much, and may or may not be more expensive than 2x the build-out in a single DC. It though raises issues of code and architecture complexity.
In several cases, we were experiencing issues that would have pervaded despite redundant equipment. E.g.: the load balancer SSL bug we encountered was present on all instances of multiple generations of the product. Providing two LBs would simply have insured that as the triggering cipher was requested, both LBs would have failed and rebooted. Something of an end-run around our Maginot line, as it were.
In the US, it's ok IFF you use recycled Halon. Halon 1211 is still ~2x as effective as the nearest "friendly" alternatives (Halotron). For a facility-scale installed system, using an alternative agent is worth it, because you can just use 2x as much chemical. For a handheld, 5-10 pound is the biggest someone will realistically carry, so having more power is worth it. My goal is to not expend this agent in the next 5-10 years, and to lose maybe 5% during that period, so there's really no downside to the environment in having it in my 3 x 5lb extinguishers vs. in the presumably older extinguishers someone else had. I hope in a decade there is a better alternative, or I'll get them topped up (you're supposed to inspect them every year or 5 years depending on where they're used, but generally a 5-10y lifespan is reasonable).
Counterpoint: As a customer, I don't want to pay for the kind of redundancy that encourages employees to shut down the whole data center when a single tantalum cap pops in a power supply somewhere.
The fact is, there aren't that many flammable things in a data center. Nobody is going to die because they wandered up and down the aisles after they "thought they smelled something burning," in the absence of any visible smoke.
The guidelines in the highest-voted answer on the SO page make sense to me. 1: If you actually see smoke or fire in any significant amount, evacuate. 2: Make someone else aware of what's going on before doing anything else. 3: Keep your escape options open. 4: Think about how much time you can safely spend "guessing", and don't exceed it. 5: Don't second-guess your own common sense. You aren't paid to be a fireman or a hero.
Almost nobody (for any reasonable definition of nobody) has immediate redundancy for their data center. I've worked for three $1B+ companies whose entire business was based on their data center being up and running - and none of them would have returned to service in less than 48 hours if a data center had gone down.
"Not being prepared for them is unforgivable" - would mean that 99% of business do not deserve forgiveness.
It just doesn't make sense to have that kind of redundancy for such a rare event for all but a very, very small minority of businesses. (Telecoms, Google, Stock Exchange, 911, etc...)
I was working at a place as a contractor and they had an amazing backup power system (expensive, diesel with batteries). Semi hits the power main outside the server room and for some reason the diesel never starts. Whole datacenter loses power in under 15 minutes.
Always plan for the single event because no amount of money will keep a single site running.
The biggest danish ISP lost service for quite some time because the truck that were to deliver their new emergency power crashed into the mains transformer stations.
Of course they are idiots, they should never have powered their own mains of before they got the new one installed.