Tell HN: AWS appears to be down again

aledalgrande · on Dec 22, 2021

If you haven't seen yet, news is it was a power loss:

> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

vinay_ys · on Dec 22, 2021

This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.

tyingq · on Dec 22, 2021

"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility." https://aws.amazon.com/compliance/data-center/infrastructure...

So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").

Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...

vinay_ys · on Dec 22, 2021

Usually when someone claims T3+ they mean they have UPS clusters in 3+1 (or such) configuration and two different such UPS clusters power two power-strips in a rack. Then, would also have incoming grid power supply from two different HV sub-stations with non-intersecting cable paths. They would also have diesel power generators in 3+1 or 5+2 configurations with automatic startup time in seconds. The UPS's power storage (chemical or potential energy based devices) can hold enough energy to handle full load for several minutes. If these are design and maintained correctly, even while concurrent scheduled maintenance is ongoing, an unexpected component failure should not cause catastrophic outage. At each layer (grid incomers, generator incomers, UPS power incomers) there are switches to switch over whenever there's a need (maintenance or failure).

If they claim tier4, then they basically have everything in n+n configuration.

tyingq · on Dec 22, 2021

Though that doesn't match very well with "uninterruptible power supply units can be engaged for certain functions". It sounds worded to convey that the UPS is limited in some way. An interesting old summary of their 2012 us-east-1 incident with power/generators/ups/switching: https://aws.amazon.com/message/67457/

dylan604 · on Dec 22, 2021

The generators should be powering up as soon as one of the 2 different sources goes down. It takes generators a few minutes to power up and get "warmed up". If they don't start this process until both mains sources are down, then oops, there's power outage.

I used to work next door to a "major" cable TV station's broadcast location. They had multiple generators on-site, and one of them was running 24/7 (they rotated which one was hot). A major power outage hit, and there was a thunderous roar as all of the generators fired up. The channel never went off the air.

tyingq · on Dec 22, 2021

There are setups where the UPS is designed to last long enough for generator spin up as well. I believe it's the most common setup if you have both. I assume spinning up the generators for very short-lived line power blips might be undesirable.

greyhair · on Dec 23, 2021

I was in a Bell Labs facility that had notoriously bad power. We occupied the building before the second main feed had been fully approved by the state and run to the building.

Our main computer lab had a serial UPS that was online 100% of the time, though the inverters where under a very light load. If the mains even acted 'weird' (dips, bad power factor, spikes) the UPS jumped full on, and didn't revert to main power until the main was stable for some duration of time. The UPS was able to carry the full lab (which was quite large) for about two hours, allowing plenty of time for the generator to fire up.

The UPS ran a lot, and because the main was 'weird', the outages were often short, the generator wouldn't even start during the first ten minutes of UPS coverage. Of course, the rest of the building would be dark, other then emergency lighting.

I was a embedded firmware engineer, and our development lab was directly on the wall behind the UPS. When it fired into 100% mode, it roared, mostly from cooling. It was sort of a heads up that the power was likely to fail soon.

dylan604 · on Dec 23, 2021

>fully approved by the stat

Why in the world would the state need to be involved in this level of decision?

BenjiWiebe · on Dec 22, 2021

Are you sure about the few minutes part? The standby generators I've seen take seconds to go from off to full load. We have an 80kw model, but I've also seen videos of load tests of much larger generators and they also take only seconds to go to full load.

reaperducer · on Dec 22, 2021

It might depend on when the backup system was built. No company updates their system every year.

A few minutes seems correct for one place I worked.

This was back in the 90's, before UPS technology got really interesting. Our system was two large rooms with racks and racks and racks of car batteries wired together. When the power went out, the batteries took over until the diesel generator could come online.

I saw it work during several hurricanes and other flood events.

I always found the idea of running an entire building off of car batteries amusing. The engineers didn't share my mirth.

pbecotte · on Dec 23, 2021

Was a generator technician before I got into programming. Even the 2 megawatt systems could start up and take full load in 10-20 seconds. It sounds basically like starting your car with your foot on the gas.

The "when" shouldn't really matter- Diesel engines aren't a new thing. Warming them up isn't really a thing either- they'll have electric warmers hooked up to the building power to keep them ready to go.

dylan604 · on Dec 22, 2021

Lead acid batteries in that form factor were the staple for many UPS systems, and the thing most people didn't really appreciate was how expensive they were to maintain. If you didn't do regular maintenance, you'd find out that one of the cells in one of the batts was dead causing the whole thing to be unable to deliver power at precisely the worst time. Financially strapped companies cut maintenance contracts at the first sign of trouble.

Edit to add: I was at a place that took over a company that had one of these. With all of the dead batteries, it was just a really really large inverter taking the 3-phase AC to DC back to AC with a really nice and clean sine wave.

idiotsecant · on Dec 22, 2021

Lead acid batteries are still industry standard in many applications where you are OK with doing regular maintenance and you just need them to work, full stop. I think you'd be surprised how much of your power generation infrustructure, for example, has a 125VDC battery system for blackouts.

mcpherrinm · on Dec 23, 2021

I think it depends on the type of generator. I know one datacenter I worked with had turbine generators that took a few minutes to get spun up. They were started and spun up by essentially a truck engine. Those generators were quite old, though.

hughesjj · on Dec 22, 2021

I thought running a generator full time was illegal AF due to environmental regulations?

AndyJames · on Dec 23, 2021

Not really. If there's no local law against it then it's legal especially outside the cities.

AtlasBarfed · on Dec 22, 2021

Has datacenter power redundancy undergone any sort of revolution with grid storage becoming industrial scale?

I wonder if a lot of AWS dc design in this area predates the battery grid storage revolution with (what my impression is) a far faster adaptation/switchover time than a generator spin up, and possibly software systems that work to detect and switch over quickly?

AWS can claim it will be best of breed, but they aren't going to throw out a DC power redundancy investment (or threaten downtime) that they can't wring more ROI on.

tyingq · on Dec 22, 2021

I'd be surprised. Data centers eat a lot of energy, and it's hard to beat the energy density of diesel (120 MJ/kg vs ~1 for batteries) and the ability to have nearby tanks or scheduled trucks.

Tesla apparently did some early pilot stuff: https://www.datacenterdynamics.com/en/analysis/teslas-powerp...

Gravityloss · on Dec 25, 2021

When deciding data center locations, companies certainly have in mind the quality of the electrical infrastructure in that country or region...

Maybe it could affect people buying services as well.

rainbowzootsuit · on Dec 22, 2021

Likely the UPS can't run HVAC, and you are in an overheat condition in about two minutes with a fully loaded data center without cooling. Proportionately longer as load is reduced.

JshWright · on Dec 23, 2021

> I really hope they publish a thorough RCA for this incident.

We're still waiting on the RCA for last week's us-west outage...

codeduck · on Dec 22, 2021

another example of a single dc in a single AZ rendering an entire region almost unusable. This has shades of eu-central-1 all over again.

nightpool · on Dec 22, 2021

Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?

matharmin · on Dec 22, 2021

Yes, I've seen issues that affected the entire region. In my specific case, I happened to have an ElastiCache cluster in the affected AZ that became unreachable (my fault for single AZ). But even now, I'm unable to create any new ElastiCache clusters in different AZs (which I wanted to use for manual failover). And there were a lot of errors on the AWS console during the outage.

"almost unusable" is maybe exaggerating, but there were definitely issues affecting more than just the single AZ.

wizwit999 · on Dec 22, 2021

That seems acceptable. The Data plane failure is contained to an AZ. Control plane is often not.

jedberg · on Dec 22, 2021

Probably because you aren’t the only one trying to do that. The folks who successfully fail over a zone are the ones who have already automated the process and are running active/active configurations so everything is set up and ready to go.

codeduck · on Dec 22, 2021

We've had alerts for packet loss and had issues in recovering region-spanning services (both AWS and 3rd party).

Yes, some of these we should be better at handling ourselves, but... it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands.

edit: just to short circuit any "well, why aren't you running redundant regions" - we run redundant regions at all times. But for reasons of latency, many customers will bind to their closest region, and the nature of our technology is highly location-bound It is not possible for us to move active sessions to an alternate region. So something like this is... unpleasant.

mentat · on Dec 22, 2021

You don't have health checks?

codeduck · on Dec 23, 2021

How are health checks supposed to help when you can't do anything?

buchanmilne · on Dec 23, 2021

You said:

> it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands

"Expect to lose an AZ" includes not being able to make any changes to existing instances in the affected AZ.

If you had instances across multiple AZs behind an ELB with health checks, then the ELB should automatically remove the affected instances.

If you have a different architecture, you would want to: * Have another mechanism that automatically stops sending traffic to impaired instances (ideal), or * Have a means to manually remove the instances from service without being able to interact with or modify those instances in any way

Does that help, or have I misunderstood your problem?

codeduck · on Dec 23, 2021

You have misunderstood our problem. Ec2 behind elb/alb/nlb is the least of our issues.

londons_explore · on Dec 22, 2021

A lot of people will automatically fail over jobs to other AZ's. That often involves spinning up lots more EC2 instances and moving PB's of data. The end result is all capacity on other AZ's gets used up, and networks get full to capacity, and even if those other zones are technically working, practically they aren't really usable.

reilly3000 · on Dec 22, 2021

While there may be more machines provisioned, many orgs run active setups for failover so they aren’t as affected. In terms of data transfer, it should already be there. Where would it come from? Certainly not the dead AZ.

manquer · on Dec 22, 2021

It is Amazon's services themselves which are advertised multi-AZ that would do bulk of this thundering hurd kind of requests.

Godel_unicode · on Dec 22, 2021

That doesn't appear to have happened though, I haven't seen issues outside az4

tyingq · on Dec 22, 2021

Perspective, I would guess. Unless you spend a lot of time on retry/timeout/fail logic around AWS apis, your app could be stuck/blocked in the RunInstances() api, for example.

SCdF · on Dec 22, 2021

So dumb question from someone who hasn't maintained large public infrastructure:

Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?

IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?

fulafel · on Dec 22, 2021

IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.

fulafel · on Dec 23, 2021

Replying to myself - also in this case people are reporting that load balancing service provided by AWS failed so it doesn't necessarily help if your own stuff is tested and working.

gpm · on Dec 22, 2021

> or are all these apps that are failing built wrong

Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.

It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.

peeters · on Dec 22, 2021

As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.

luhn · on Dec 22, 2021

It sounds like EC2 API is having a brownout due to this, so a lot of people can't failover to a new AZ.

robjan · on Dec 22, 2021

That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.

sprite · on Dec 22, 2021

I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?

TruthWillHurt · on Dec 22, 2021

Amazon shifts the responsibility for multi-AZ deployment to us customers, saving themselves complexity and charging us extra - win-win for them.

_joel · on Dec 22, 2021

You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's

xyst · on Dec 22, 2021

This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.

philsnow · on Dec 22, 2021

In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?

It's turtles all the way down, and underneath all the turtles is us-east-1.

notyourday · on Dec 22, 2021

We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.

alostpuppy · on Dec 22, 2021

Why do folks host their stuff in us-East? Is there a draw other than organizational momentum?

dragonwriter · on Dec 22, 2021

> Why do folks host their stuff in us-East?

Off the top of my head, US-EAST-1 is:

(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),

(2) consistently in the first set of regions to get new features,

(3) usually in the lowest price tier for features whose pricing varies by region,

(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.

#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.

alostpuppy · on Dec 22, 2021

Thank you! This one is why I love HN.

superdug · on Dec 22, 2021

It's the cheapest.

deanCommie · on Dec 22, 2021

us-east-2 has exactly the same prices as us-east-1.

res0nat0r · on Dec 22, 2021

Most likely inertia. us-east-1 was the first AWS region, gets new features released there first and is the largest in the USA, so many companies have been running their for many years, and the cost of moving to us-east-2 > the cost of occasional AWS created downtime.

GrumpyNl · on Dec 22, 2021

How come they dont have power backups?

chkhd · on Dec 22, 2021

"When a fail-safe system fails, it fails by failing to fail-safe." - https://en.wikipedia.org/wiki/Systemantics

2-718-281-828 · on Dec 22, 2021

is that just playing with words?

NovemberWhiskey · on Dec 22, 2021

I think it's predicated on a misunderstanding of what "fail-safe" actually means.

For example, in railway signaling, drivers are trained to interpret a signal with no light as the most restrictive aspect (e.g. "danger"). That way, any failure of a bulb in a colored light signal, or a failure of the signal as a whole, results in a safe outcome (albeit that the train might be delayed while the driver calls up the signaler).

Or, in another example from the railways, the air brake system on a train is configured such that a loss of air pressure causes emergency brake activation.

Fail-safe doesn't mean "able to continue operation in the presence of failures"; it means "systematically safe in the presence of failure".

Systems which require "liveness" (e.g. fly-by-wire for a relaxed stability aircraft) need different safety mechanisms because failure of the control law is never safe.

pdpi · on Dec 22, 2021

> "systematically safe in the presence of failure".

And even then, you still need to define "safe". Imagine a lock powered by an electromagnet. What happens if you lose power?

The safety-first approach is almost always for the unpowered lock to default to the open state — allow people to escape in case of emergency.

Conversely, the security-first approach is to keep the door locked — nothing goes in or out until the situation is under control.

A more complex solution is to design the lock to be bistable. During operating hours when the door is unlocked, failure keeps it unlocked. Outside operating hours, when the door is set to locked, it stays locked.

The common factor with all these scenarios is that you have a failure mode (power outage), and a design for how the system ensures a reasonable outcome in the face of said failure.

jsmith99 · on Dec 22, 2021

Or nuclear reactors that fail safe by dropping all the control rods into the core to stop all activity. The reactor may be permanently ruined after that (with a cost of hundreds of millions or billions to revert) but there will be no risk of meltdown.

sgarland · on Dec 22, 2021

Sort of. A failsafe reactor design [can] include[s] things like:

* Negative temperature coefficient of reactivity: as temperature increases, the neutron flux is reduced, which both makes it more controllable, and tends to prevent runaway reactions.

* Negative void coefficient of reactivity: as voids (steam pockets) increase, the neutron is reduced.

* Control rods constructed solely of neutron adsorbent. The RBMK reactor (Chernobyl) in particular used graphite followers (tips), which _increased_ reactivity initially when being lowered.

It's also worth noting that nuclear reactors are designed to be operated within certain limits. The RBMK reactor would have been fine had it been operated as designed.

Source: was a nuclear reactor operator on a submarine.

NovemberWhiskey · on Dec 22, 2021

I don't know enough about reactor control systems to be sure on that one. The idea of a fail-safe system is not that there's an easy way to shut them down, but more that the ways we expect the component parts of a system to fail result in the safe state.

e.g. consider a railway track circuit - this is the way that a signaling system knows whether a particular block of a track is occupied by a train or not. The wheels and axle are conductive so you can measure this electrically by determining whether there's a circuit between the rails or not.

The naive way to do this would be to say something like "OK, we'll apply a voltage to one rail, and if we see a current flowing between the rails we'll say the block is occupied." This is not fail-safe. Say the rail has a small break, or if power is interrupted: no current will flow, so the track always looks unoccupied even if there's a train.

The better way is to say "We'll apply a voltage to one rail, but we'll have the rails connected together in a circuit during normal operation. That will energize a relay which will cause the track to indicate clear. If a train is on the track, then we'll get a short circuit, which will cause the relay to de-energize, indicating the track is occupied."

If the power fails, it shows the track occupied because the relay opens. If the rail develops a crack, the circuit opens, again causing the relay to open and indicate the track is occupied. If the relay fails, then as long as it fails open (which is the predominant failure mode of relays) the track is also indicated as occupied.

losvedir · on Dec 22, 2021

No. For example train signalling which controls whether a train can do onto a section of track operates in a fail safe manner, where if something goes wrong, the signal fails into a safe "closed" state rather than an unsafe "open" state. This means trains are incorrectly being told to stop even though technically the tracks are clear, rather than incorrectly being told to go even though there is another train ahead.

"fail-safe" doesn't mean "doesn't fail", it means that the failure mode chooses false negatives or false positives (depending on the context) to be on the safe side.

marcosdumay · on Dec 22, 2021

You mean to ask if it's a joke? Yes, it's a joke.

Or you ask if it's a lesson about how real systems operate? Because yes, it's a very serious lesson about how real systems operate.

Anyway, you seem out of grasp on system engineering. Your reply downthread isn't applicable (of course fail-safes can fail, anything can fail). If you want to learn more on this area (not everybody wants, and its ok), following that link of system theory books on the wiki may be a good idea. Or maybe start at the root:

https://en.wikipedia.org/wiki/Systems_theory

Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either.

jerf · on Dec 22, 2021

"Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either."

In my experience, you can be specific, but then you get the problem that people think that if they just 'what if' a narrow solution to the particular problem you're presenting they've invalidated the example, when the point was 1. that this is a representative problem, not this specific problem and 2. in real life you don't get a big arrow pointing at the exact problem 3. in real life you don't have one of these problems, your entire system is made out of these problems, because you can't help but have them, and 4. availability bias: the fact that I'm pointing an arrow at this problem for demonstration purposes makes it very easy to see, but in real life, you wouldn't have a guarantee that the problem you see is the most important one.

There's a certain mindset that can only be acquired through experience. Then you can talk systems engineering to other systems engineers and it makes sense. But prior to that it just sounds like people making excuses or telling silly stories or something.

"(of course fail-safes can fail, anything can fail)"

Another way to think of it is the correlation between failure. In principle, you want all your failures to be uncorrelated, so you can do analysis assuming they're all independent events, which means you can use high school statistics on them. Unfortunately, in real life there's a long tail (but a completely real tail) of correlation you can't get rid of. If nothing else, things are physically correlated by virtue of existing in the same physical location... if a server catches fire, you're going to experience all sorts of highly correlated failures in that location. And "just don't let things catch fire" isn't terribly practical, unfortunately.

Which reiterates the theme that in real life, you generally have very incomplete data to be operating on. I don't have a machine that I can take into my data center and point at my servers and get a "fire will start in this server in 89 hours" readout. I don't get a heads up that the world's largest DDOS is about to be fired at my system in ten minutes. I don't get a heads up that a catastrophic security vulnerability is about to come out in the largest logging library for the largest language and I'm going to have a never-before-seen random rolling restart on half the services in my company with who knows what consequences. All the little sample problems I can give in order to demonstrate systems engineering problems imply a degree of visibility you don't get in real life.

itsoktocry · on Dec 22, 2021

>is that just playing with words?

It conveys reality, that "fail-safe" isn't literal, as if anyone believed that.

2-718-281-828 · on Dec 22, 2021

I mean it has to be play with words or tongue in cheek simply b/c the assumption of a fail-safe system failing is already contradictory. So you cannot say anything smart about that beyond - there are no fail-safe systems that fail.

seeking_future · on Dec 22, 2021

The real world is the play. Words are just catching up.

Talanes · on Dec 22, 2021

https://en.wikipedia.org/wiki/Gare_de_Lyon_rail_accident

Fail safes do fail. Often due to severe user error.

frupert52 · on Dec 22, 2021

Do you mean in that it fails by failing to be the thing that it purports to be? Making it no longer that thing? At what point does bread become toast?

the-dude · on Dec 22, 2021

An unknown unknown.

redm · on Dec 22, 2021

Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.

I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.

taf2 · on Dec 22, 2021

it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...

IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.

res0nat0r · on Dec 22, 2021

If the rack your instances are running in are totally offline then the ec2 api unfortunately can't talk to the dom0 and tell the instances to stop/start, so you get annoying "stuck instances", and really can't do anything until the rack is back online and able to respond to API calls unfortunately.

chousuke · on Dec 22, 2021

Sometimes, you have a component which fails in such a way that your redundancies can't really help.

I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.

Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.

macintux · on Dec 22, 2021

To my mind, among the more frustrating aspects to implementing protection against failure is that the mechanisms to be added can themselves cause failure.

It's turtles all the way down.

chousuke · on Dec 22, 2021

You need to pick your battles and choose what you want to protect against to mitigate risk and enable day-to-day operations.

For example, too often people will set up clustered databases and whatnot because "they need HA" without much thought about all the other potential effects of using a cluster, such as much more complicated recovery scenarios.

In the vast majority of cases, an active-passive replicated database with manual failover is likely to have fewer pitfalls and gives you the same operational HA a clustered database would, even though in the case of a (rare) real failure it would not automatically recover like a cluster might.

trelane · on Dec 22, 2021

Anything can fail, even your backup, and especially if it's mechanical.

rdines · on Dec 22, 2021

The battery backups (called uninterruptible power supplies) are only meant to bridge the gap between the power going out and the generator turning on, which is a few minutes. Did they say power was the issue this time? I suspect it’s actually something else (ahem network)

Spooky23 · on Dec 22, 2021

Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.

thetinguy · on Dec 22, 2021

They do. I remember watching one of their sessions where they showed every rack having its own battery backup.

tyingq · on Dec 22, 2021

An article on that: https://datacenterfrontier.com/aws-designs-in-rack-micro-ups...

Interesting quote:

“This is exactly the sort of design that lets me sleep like a baby,” said DeSantis. “And indeed, this new design is getting even better availability” – better than “seven nines” or 99.99999 percent uptime, DeSantis said.

TrueDuality · on Dec 22, 2021

According to the SOC certifications they give their customers they do.

ItsBob · on Dec 22, 2021

I've built out many 42U racks in DC's in my time and there were a couple of rules that we never skipped:

1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens 2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.

I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!

What am I missing?

I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?

growse · on Dec 22, 2021

> 1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens

Nothing happens if you remember that your new capacity limit per DC supply is 50% of the actual limit, and you're 100% confident that either of your supplies can seamlessly handle their load suddenly increasing by 100%.

I've seen more than one failure in a DC where they wired it up as you described, had a whole power side fail, followed by the other side promptly also failing because it couldn't handle the sudden new load placed on it.

dijit · on Dec 22, 2021

EDIT: I misunderstood you were talking about power feeds, the normal case is the run "48% as if it's 100%" (because of power spikes, but also most types of transformers run more efficiently under specific levels of load (40-60).

Normally this is factored into the Rack you buy from a hardware provider, they will tell you that you have 10A or 16A on each feed, if you exceed that: it will work, but you are overloading their feed and they might complain about it.

vel0city · on Dec 22, 2021

The poster was speaking more of the power delivery going to the power supplies, not the server's power supplies themselves. So say each PSU 1 is wired to circuit A, each PSU 2 is wired to circuit B. Circuit A experiences a failure. All servers instantly switch over all their load to their PSU 2's on circuit B. Suddenly circuit B's load is roughly double what it was just moments ago. If proper planning wasn't created or followed, this might overload circuit B, meaning all PSU 2's go dark regardless of the server being able to do the change over or not.

dijit · on Dec 22, 2021

Yeah I understand on re-reading: but that's also not how people run datacenters.

Obviously people can operate things however they want, but you wont get a tier 3 classification with that setup.

praseodym · on Dec 22, 2021

OP is talking about the DC power feed, not a single server PSU.

dijit · on Dec 22, 2021

You don't get fed DC power, you get fed AC power.

But, point taken: yes your power feed should be running at <50%. But that just means you treat 50% as 100% just like any resource.

Mostly this is outsourced to the datacenter provider; they'll give you a per side rating. (usually 10A or 16A) which also matches the cooling profile of the cabinet.

vel0city · on Dec 22, 2021

I mean, in some datacenters they run DC power to each rack. Its definitely more esoteric than having each device run AC but some people do it.

However, with their comment DC == Data Center, not Direct Current.

dijit · on Dec 22, 2021

Yeah, I got thrown off by the "per DC supply is 50% of the actual limit"

DC = Datacenter? makes no sense, so my head replaced it with "Power Supply" instead of "DC Supply", second sentence does make sense as being datacenter though.

notyourday · on Dec 22, 2021

> I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

This is all local scale. Your setup would not survive a data center scale power outage. At scale power outages are datacenter scale.

Data centers lose supply lines. They lose transformers. Sometimes they lose primary feed and secondary feed at the same time. Automatic transfer switches cannot be tested periodically i.e. they are typically tested once. Testing them is not "fire up a generator and see if we can draw from it"

It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations.

bombcar · on Dec 22, 2021

The datacenter we were in had dual-sourced grid power (two separate grid connections on opposite sides of the block, coming from different substations) along with a room of batteries (good for iirc 1hr total runtime for the whole datacenter, setup in quad banks, two on each "rail"), _and_ multiple independent massive diesel generators, which they ran and switched power to every month for at least an hour.

And to top it off each rack had its own smaller UPS at the bottom and top, fed off both rails, and each server was fed from both.

We never had a power issue there; in fact SDGE would ask them to throw to the generators during potential brown-out conditions.

Of course this was a datacenter that was a former General Atomics setup iirc ...

notyourday · on Dec 22, 2021

We were in a triple sourced data center. Fed by three different substations. Everything worked like a charm. Until Sandy hit. It did not affect us at all. But it affected the power company. And everything still worked fine, until one of the transfer switches transferred into UPS position and stopped working in that position.

ItsBob · on Dec 22, 2021

Yes but if you have reliable power from two different sources then the biggest risk (I'd imagine) is the failover circuitry! Something that should be tested tbh.

Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?

Again, this is all pure speculation: I have absolutely no idea of the exact failure, nor how their infrastructure is held together - this is all just speculation for the hell of it :)

notyourday · on Dec 22, 2021

> Yes but if you have reliable power from two different sources then the biggest risk (I'd imagine) is the failover circuitry! Something that should be tested tbh.

That's ATS. It is not really advisable to test their under load performance because the failure of an ATS would be catastrophic. ATS typically would be tested at the installation and after that their parameters would be monitored.

Replacing a functional in line ATS would be a 9-12 months long project.

> Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?

At high energy you are pretty much always going to use an ATS.

belfalas · on Dec 22, 2021

> the failure of an ATS would be catastrophic

Because that would mean no power at all to the DC and no way to get it back? (I am completely ignorant on this topic)

notyourday · on Dec 22, 2021

> Because that would mean no power at all to the DC and no way to get it back? (I am completely ignorant on this topic)

While most of smarts in the ATS are in the electronics, the really nasty failures come from the mechanical part.

At the end of the day a high energy ATS looks just like a switch behind a meter in your house. There's a lip that goes from one position to another, except in a high energy ATS the lip is big and when the transfer occurs it slams from one source to another.

There are only so many of those physical slams that it can withstand to being with so you want to minimize that number.

The second failure mode is that after transfer to non-main source, the lip can get stuck there, making it impossible to switch back on the main. [Once I have seem the lip melt into the secondary position. While I thought it was weird, the guys from the power company said it is not that uncommon.] This creates a massive problem as the non-main source is typically not designed for long term 24x7 operation. So now you are stuck on a secondary feeding system and you cant just transfer to main without de-energizing the system i.e. taking the power out of the entire data center.

merlyn · on Dec 22, 2021

Frying hardware can affect much wider scope.

I've had bad power supplies fry out taking the whole power circuit with it, and thus half (or whatever fraction) of the rack's power. I've also had bad power supplies bring down the whole machine as they shunted everything internal too.

When things go bad, anything can happen. You can provide the best effort, and it'll usually work as expected, but there will always be something that can overcome your best efforts.

vel0city · on Dec 22, 2021

The only full datacenter outage I've personally experienced was a power maintenance tech testing the transfer switch between systems where the power was 90 degrees out of phase. Big oof.

theideaofcoffee · on Dec 22, 2021

Transfer switches at any facility that's worth being colocated in are exercised as periodically as the generators to which they connect. In all of the facilities I have had systems in (>20MW total steady state IT load), that meant once per month at minimum to keep generators happy -and to ensure the transfer functionality works-, and more often if the local grid demands it, e.g. ComEd in Chicago, or Dominion in NoVA asking for load shedding.

ClumsyPilot · on Dec 22, 2021

"It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations."

Citation needed - the same issue with testing, data races and expensive bandwidth come up.

notyourday · on Dec 22, 2021

At high energy the lead time for the components is measured not in days but in years.

ClumsyPilot · on Dec 22, 2021

And so is development time of any distributed software system, and training time required to operate it correctly

notyourday · on Dec 22, 2021

> And so is development time of any distributed software system, and training time required to operate it correctly

Software is much easier than hardware. If you are to start a project today in this kind of hardware, you will be operating it in 2029, without changes.

ClumsyPilot · on Dec 22, 2021

"Software is much easier than hardware. If you are to start a project today in this kind of hardware, you will be operating it in 2029, without changes."

I don't think this makes sense, you are using the three statements "Software is cheaper", "Software takes less time" and "software is easier" as if they all mean the same thing, and proving one means proving all of them.

Hardware takes a long time, okay, that does not mean it's expensive. Building a hydroelectric dam takes 20 years, but it provides the cheapest source of electricity that ever existed. Ships can take a decade from order to delivery, they are the cheapest mode of transport.

uluyol · on Dec 22, 2021

Why spend the cost on dual X and Y when you can failover to another cluster?

For big DC workloads, it is usually, though not always, better to take the higher failure rate than add redundancy.

ItsBob · on Dec 22, 2021

Really? You'd think at Amazon's scale an additional PSU in a 1U custom-built server (I assume they're custom) would be a few tens of $ at most.

Actually, now that I type that it makes sense. Scaling a few tens of dollars to a bajillion servers on the off-chance that you get an inbound power failure (quite rare I'd reckon) might cost more than what they'd lose if it does actually fail.

So yeah, they're potentially just balancing the risk here and minimising cost on the hardware.

Edit: changed grammar a bit.

vel0city · on Dec 22, 2021

At big cloud provider scale like Amazon, Azure, and Google they probably aren't even running PSUs at each server, they're probably doing DC at the rack these days. No point in having a million little transformers everywhere, far easier maintenance centralizing those and have multiple feeding the bus bars going to each rack.

rainbowzootsuit · on Dec 22, 2021

The ones Im seeing designed have been moving the DC out to the cabinets with A/B 480VAC power feeds on the bus, and integrated DC inverters/rectifiers/batteries at the rack level.

More modular and a lot less copper at 10x the voltage. Still a lot of copper.

bob1029 · on Dec 22, 2021

> I've never worked anywhere with Amazon's budget so why are they not handling this?

Perhaps we are going to discover how AWS produces such lofty margins by way of their next RCA publication.

Bluecobra · on Dec 22, 2021

> What am I missing?

My guess is that they cheaped out in having redundant PSUs to get you to use multiple availability zones. (More zones = more revenue)

Even a single PSU shouldn’t be an issue if they plugged in an ATS switch though.

Godel_unicode · on Dec 22, 2021

Unless the ATS breaks, which happens.

mnordhoff · on Dec 22, 2021

Yup. I'm still upset (but not angry) about https://status.linode.com/incidents/kqhypy8v5cm8.

Bluecobra · on Dec 22, 2021

For sure, in my context I meant a ATS in single rack/cabinet. If that went bad the blast radius would be contained to a single cabinet. But yeah, anything can and will happen. At another place I worked at, a site UPS took down an entire server room. It was pretty nice Eaton system but there was some event that fried the whole thing. Eaton had to send an specialist to investigate the matter as those events are pretty rare.

lordnacho · on Dec 22, 2021

What about a UPS/battery thingy? That's saved me a few times, though it normally just gives enough time for a short outage. Is it uncommon in cloud infra?

vel0city · on Dec 22, 2021

For even regular datacenters they'll often have UPS systems the size of a small car, usually several of these, to power the entire datacenters for a few minutes to get the diesel generator started.

Hippocrates · on Dec 22, 2021

Every time a major cloud provider has an outage, Infra people and execs cry foul and say we need to move to <the other one>. But does anyone really have an objective measure of how clouds stack up reliability-wise? I doubt it, since outages and their effects are nuanced. The other move is that they want to go multi-cloud... But I’ve been involved in enough multi-cloud initiatives to know how much time and effort those soak up, not to mention the overhead costs of maintaining two sets of infra sub-optimally. I would say that for most businesses, these costs far exceed that occasional six-hour-long outage.

mijoharas · on Dec 22, 2021

I mean from the explanation[0], assuming that is correct (I don't have evidence to suggest it's false) - you don't need to be multi-cloud, and you don't even need to be multi-region. As long as you're spread out over multiple availability zones in a region you should be resilient to this failure.

Somewhat surprising to see how many things are failing though, which implies, either that a lot of services aren't able to fail-over to a different availability zone, or there is something else going wrong.

[0] https://news.ycombinator.com/item?id=29648992

omh2 · on Dec 22, 2021

AWS doesn't follow their own advice about hosting multi-regional so every time us-east-1 has significant issues pretty much every AZ and region is affected.

Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1. So called Global endpoints are also dependent on us-east-1 and parts of AWS' internal event queues (eg. event bridge triggers)

If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region

spmurrayzzz · on Dec 22, 2021

Your last comment is really important, I think. I have always petitioned for "passive over active" design in distributed cloud systems. The recent outages, and also ones from the past, demonstrate why.

The fewer API calls you need to make in-band with whatever throughput is generated via your customer demand, the better. Related to that, I have been critical of lambda/FaaS/serverless infrastructure patterns for similar reasons. Always felt like a brittle house of cards to me (N.B. I do still use aws lambda, but keep it constrained to non-critical workloads).

pm90 · on Dec 22, 2021

> The fewer API calls you need to make in-band with whatever throughput is generated via your customer demand, the better.

Agreed; however, this is somewhat difficult to do correctly. There are all sorts of systems that might have hidden dependencies on managed services. e.g. AWS IAM roles will almost always be checked at some point if your services need to interact with AWS managed services.

I think cloud providers could meet developers half way here, by providing ways to reduce API usage; but I'm not sure if it aligns with their incentives.

electroly · on Dec 22, 2021

AWS IAM is designed with a control plane / data plane dichotomy. Even if the control plane is completely dead and all API requests are failing, services in a steady-state (i.e. not responding to changes via API calls) can still rely on IAM roles using their cached information. For example, in the recent us-east-1 outage when you couldn't start new services because IAM checks would fail, existing EC2 instances that rely on IAM instance profiles to access services like S3 could still do so even though IAM was down.

spmurrayzzz · on Dec 22, 2021

I was gonna respond with the same commentary here. That has been my experience beyond just IAM controls and why I advocate for passive systems for critical workloads.

Sometimes this _can_ be costly. For example with something like autoscaling, thats an active system I've seen fail when seemingly unrelated systems are failing. The result is scaling out systems intentionally ahead of time to deal with oversubscription or burst traffic which can leave you with (costly) idle compute.

I don't mind this tradeoff personally, but can understand that budget constraints are going to be different org to org.

0xbadcafebee · on Dec 22, 2021

It's not an AWS incentive thing really, it's a developer/consumer incentive thing.

It's like the duality of modular code. If you want to manage one change in a lot of places, it's easiest to change it in the one module that everything else sources. But that means that one change to that module can take down everything. The alternative where you copy+paste the same change everywhere is the most resilient to failure, but also the most difficult and expensive.

AWS provides a lot of modular, dynamic things because that's what their customers want to use. But using each of those things increases the probability of failure. It's up to the customer to decide how they want to design their system using the components available.... and the customers always chose the easy path rather than the resilient path.

The great thing is that with AWS, at least you have the option to design a super freaking reliable system. But ultimately there's no way to make it easy, short of a sort of "Heroku for super reliable systems". (I know there are a few, but I don't know anything about them)

spmurrayzzz · on Dec 23, 2021

I like the way you framed this, its a tradeoff mainly. You can build something fault-tolerant and highly-available, even during AWS outage events, but you have to give up a ton of the product offerings in their suite.

I've managed to stick to EC2/ELB and S3 as passive systems for the vast majority of what we build at my org (~90% of our stack). And for the most part, AWS failures are hitless for us as a result.

Hippocrates · on Dec 22, 2021

Yeah, my thought is not specific to this scenario. Indeed multi-AZ is a low cost and probably good idea because you often have a shared service management, control plane, and cheap bandwidth between things. Of course, when things fail they often ripple as may be the case here. I don't think clouds have their blast radius perfectly contained and they certainly don't communicate those details well.

One incident I recall was involving our GCP regional storage buckets, which we were using to achieve mutli-region redundancy. One day, both regions went down simultaneously. Google told us that the data was safe but the control plane and API for the service is global. Now I always wonder when I read about MR what that actually means...

zeckalpha · on Dec 22, 2021

That’s true for this failure but the prior two for AWS were region wide and the one for GCP last month was global.

sdevonoes · on Dec 22, 2021

Perhaps is us, the customers (and our customers, and the customers of our customers, ...), the ones who should get used to the status of "things can go wrong"? Except for some specific scenarios (medical-related stuff, for instance), if my favourite online shopping place is down, well, it's down, I'll buy later.

metadat · on Dec 22, 2021

I know the Oracle OCI cloud has a reputation for never going hard-down, but also realize HN seems to loathe Big Red (understandably, to a degree, though OCI is pretty nice IME and _very_ predictable).

SixDouble5321 · on Dec 25, 2021

I don't think it's unfair. They aren't the worst villain, but they are up there.

indigomm · on Dec 22, 2021

> I doubt it, since outages and their effects are nuanced.

Your point here deserves highlighting. A failure such as a zone failing is nowadays a relatively simple problem to have. But cloud services do have bugs, internal limits or partial failures that are much more complex. They often require support assistance, which is where the expertise of their staff comes into play. Having a single provider that you know well and trust is better than having multiple providers where you need to keep track of disparate issues.

mongrelion · on Dec 22, 2021

I agree with you. I think that having multi-AZ is the first thing to figure out before wanting to do multi-cloud, which is just another buzzword taken out of management's bullshit bucket :)

Hippocrates · on Dec 22, 2021

Agree, and multi AZ is usually easy. IME with AWS and GCP the control plane is the same, the scaling works across AZ, bandwidth is free and latency is near zero. The level of effort to do that is simply ticking the right boxes at setup time IME.

Jweb_Guru · on Dec 23, 2021

Cross-AZ bandwidth is far from free and the biggest reason companies avoid it (IMO). Also latency is not near zero but I don't think that's the primary reason.

mongrelion · on Dec 23, 2021

That's true. The moment your data leaves the region you start paying for egress and that can get expensive quickly. Still beats the crap out of multicloud, though :)

jtc331 · on Dec 22, 2021

I’ve seen at least half a dozen full region AWS issues in the past 8 months.

You really need multi-region and also not be relying on any AWS service that’s located only in us-east-1 (including everything from creating new S3 buckets to IAM’s STS).

sfoley · on Dec 23, 2021

Who says this? I have literally never once seen this.

hnarn · on Dec 22, 2021

Is there a history of AWS downtimes available somewhere? This makes what, three times in as many months?

edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.

colinbartlett · on Dec 22, 2021

I have tons of this kind of data due to my side project, StatusGator. For some services like the big cloud providers I have data going back 7 years.

There indeed has been an uptick in AWS outages recently. You can see a bit of the history here: https://statusgator.com/services/amazon-web-services

exikyut · on Dec 22, 2021

(I was idly curious. It appears this data is available as part of the ~US$280/mo tier, along with a bunch of other things.)

MatteoFrigo · on Dec 22, 2021

I don't know about AWS, but both Google Cloud and Oracle Cloud maintain at least a high level history of past outages. See https://status.cloud.google.com/summary and https://ocistatus.oraclecloud.com/history

dijit · on Dec 22, 2021

Given the hilariously awful reputation of the AWS status page I would hazard a guess that such a page would also be incredibly inaccurate.

If you can’t even admit you’re having an issue how can you keep an accurate record?

cassianoleal · on Dec 22, 2021

Similar with GCP. We had a pretty bad outage once where the status page was showing all green. Google informed us that because the actual issue was further down the stack and didn't trigger any internal SLOs the status didn't get an update. It took them hours to acknowledge and fix it.

dijit · on Dec 22, 2021

Assuming you have a support contract the rep should send out a post-mortem page.

This is what happens when we've been affected by outages (even without involving support).

cassianoleal · on Dec 22, 2021

I think they did eventually but it took us quite a bit of troubleshooting, then creating a P1 ticket, then their investigation in order to get to the bottom of it and getting it fixed. And the status page never got an update, which is the subject I was adding to.

LuciusVerus · on Dec 22, 2021

I'd say three times in as many weeks, give it or take

spmurrayzzz · on Dec 22, 2021

This is a little more broad, beyond just cloud infra providers, but includes some of the kind of data you're looking for (post-mortems for outage events): https://github.com/danluu/post-mortems

andyjih_ · on Dec 22, 2021

The most hilarious irony of not being able to acknowledge a 4AM page in the PagerDuty mobile app because AWS is down.

exikyut · on Dec 22, 2021

(Which was about AWS being down?)

JCM9 · on Dec 22, 2021

AWS didn’t “go down”. They had an outage in one AZ, which is why there are multiple AZs in each region. If your app went down then you should be blaming your developers on this one, not AWS. Those having issues are discovering gaps in their HA designs.

Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.

People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

matharmin · on Dec 22, 2021

AWS is under-reporting the severity of the issue though. The primary outage may be in a single AZ, but there are parts of the AWS stack that affected all AZs in us-east-1, and potentially other regions as well. For example, even now I'm unable to create a new ElastiCache cluster in different AZs of us-east-1.

zymhan · on Dec 22, 2021

> I'm unable to create a new ElastiCache cluster in different AZs of us-east-1

Isn't that because Elasticache will distribute the cluster across AZs automatically?

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...

matharmin · on Dec 22, 2021

In this case, this was specifically with a single-AZ setup, using an AZ that was supposed to be unaffected.

boudin · on Dec 22, 2021

Issues are across all us-east 1, not one AZ.

Load balancers are not doing well at all. The only way in this case to avoid an outage is to be cross regions or cross cloud which is quite more complex to handle and require more resources to do well.

And I hope that nobody is listening your blaming and pointing fingers advice, that's the worst way to solve anything.

It's AWS job to ensure that things are reliable, that there is redundancy and that multi-AZ infra should be safe enough. The amount of issues in US-EAST-1 lately is really worrying.

acdha · on Dec 22, 2021

Some load balancers may be having issues but I have multiple busy workloads showing no issues all morning. One big challenge can be that some people reporting multi-AZ issues are shifting traffic and competing with everyone else, while workloads which were already running in the other AZs were fine. It can be really hard to accurately tell how much the problems you’re seeing generalize to everyone else.

I do agree that the end of this year has been a very bad period for AWS. I wonder whether there’s a connection to the pandemic conditions and the current job market – it feels like a lot of teams are running without much slack capacity, which could lead to both mistakes and longer recovery times.

boudin · on Dec 22, 2021

I hope AWS will provide some explanation about those issues and what actions they will take to prevent those in the future

On our side we saw some EC2 VM totally disconnected from the network in 3 AZs.

acdha · on Dec 22, 2021

Yeah, definitely needs good visibility. They’re asking customers to trust them to a large degree.

bcrosby95 · on Dec 23, 2021

But if AWS can't handle the shifting traffic in response to an AZ going down then at that point you're just gambling on whether or not you're in the lucky AZ.

phamilton · on Dec 22, 2021

Echoing this. We had to manually intervene and cut off the faulty AZ because our ASGs kept spinning up instances in it and our load balancers kept sending traffic to bad hosts.

In the past I've seen both of those systems seamlessly handle an AZ failure. Today was different.

tluyben2 · on Dec 22, 2021

> People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen. These entire datacenter failures should be anticipated, but to expect them is a bit too easy I think. There are plenty of hosters who don't have this stuff even once for the last decade in their only datacenter. I do not find it strange to expect or even demand that level but to protect yourself if it happens in any case if that fits your specific project and budget.

Edit; OK meant that raid-5 remark in the same context as the hosting; it can and does happen but it shouldn't; you should plan for a contingency but expect it goes far. We never had it (1000s of hard-drive, decades of hosting, millions of sites) and so we plan for it with backups; if it happens it will take some downtime but it costs next to nothing over time to do that. If we expected it, we would need to take far different measures. And we had less downtime in a decade than aws AZ had in the past months. I have a problem with the word 'expect'.

phone8675309 · on Dec 22, 2021

> Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

There are plenty of situations where this might happen if they’re in your room: a lightning strike can cause a surge that causes the disks to fry, a thief might break in and steal your system, your house might burn down, an earthquake could cause your disks to crash, a flood could destroy the machines, and a sinkhole could open up and swallow your house. You may laugh at some of these as being improbable, but I have seen _all_ of these take out systems between my times in Florida (lightning, thief, sinkhole, and flood) and California (earthquake and house fire).

The fix for this is the same fix as being proposed by the parent post - putting physical space between the two systems so if one place become unavailable you still have a backup.

greiskul · on Dec 22, 2021

I have had a job where my small, internal tool, for debugging purposes, had to be deployed to a minimum of 3 datacenters. I had 2 of them in the US and one in Europe, and was asked to move one of the US ones to a datacenter that was in another coast, cause who knows, maybe an earthquake will knock off all of the US west coast. That is the paranoia level necessary to achieve crazy high uptime.

acdha · on Dec 22, 2021

> If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

Here are some examples where that happened:

1. Drive manufacturer had a hardware issue affecting a certain production batch, causing failures pretty reliably after a certain number of power-on hours. A friend learned the hard way that his request to have mixed drives in his RAID array wasn’t followed.

2. AC issues showed a problem with airflow, causing one row to get enough warmer that faults were happening faster than the RAID rebuild time.

3. UPS took out a couple racks by cycling power off and on repeatedly until the hardware failed.

No, these aren’t common but they were very hard to recover from because even if some of the drives were usable you couldn’t trust them. One interesting dynamic of the public clouds are that you tend to have better bounds on the maximum outage duration, which is an interesting trade off compared to several incidents I’ve seen where the downtime stretched into weeks due to replacement delays or manual rebuild processes.

8note · on Dec 22, 2021

More generally, any correlation between two items gives potential for a correlated failure.

Same manufacturer, same disk space, same location, same operator, same maintenance schedule, same legal jurisdiction, same planet, you name it, and there's a common failure to match

AtNightWeCode · on Dec 22, 2021

"Won't happen". The 40,000 hours of runtime bug did happen. I would recommend people to take backups and store them offline or at least isolated from the main storage.

tluyben2 · on Dec 22, 2021

Sure, I plan for it but I do not expect it. And it never happened for me over decades. But I did plan for it, just not in the way the parent said and it did cost me far less.

tluyben2 · on Dec 22, 2021

Good to know ;) still think I am lucky; have had very little harddisks fail across 1000s. And only very few unrecoverable (had to restore from backup) failures; those weren't HD failures but software failures; hds were fine.

dylan604 · on Dec 22, 2021

>And that won't happen

HA! I had received new 16-bay chasis and all of the drives needed plus cold spares for each chasis. Set them up and started the RAID-5 init on a Friday. Left them running in the rack over the weekend. Returned on Monday to find multiple drives in each chasis had failed. Even with dedicated one of the 16 drives as a hot swap, the volumes would all have failed in an unrecoverable manner.

All drives were purchased at the same time, and happened to all come from a single batch from the manufacture. The manufacture confirmed this via serial numbers, and admitted they had an issue during production. All drives were replaced and at a larger volume size.

TL;DR: Drives will fail, and manufacturing issues happend. Don't buy all of your drives in an array from the same batch! It will happen. To say it won't is just pure inexeperience.

tluyben2 · on Dec 22, 2021

Guess I was lucky, we ran a lot of these over the decades when things were far more unreliable than now and never experienced anything like that. Manufacturing issues, sure, but we always had everything we bought run on stress for 48 hours and see if that killed it, if it didn't, it didn't usually break anymore (I have many of the machines from mid to end 2000s still and they don't have diskfailures now while they ran for many years).

dannyw · on Dec 22, 2021

The older hdds imo are more reliable than the newer ones. They're pushing for density, and mitigating the inheritly higher sensitivity through aggressive error correction.

Tempest1981 · on Dec 22, 2021

Would like to know the manufacturer and model.

dylan604 · on Dec 22, 2021

Sorry, this was back in 2006-2007 time frame. I have no idea on model numbers as that's just not information I ever cared to commit to memory.

tluyben2 · on Dec 22, 2021

My time frame is about since end 90s to now. I saw more failures in general in the olden days. I have quite a lot of rented servers at traditional hosters or Colo that have not been down outside kernel security updates for 10+ years. No hardware issues. I am now swapping them out for new servers which cost less for more, but hardware wise they have been pounder hard for 10 years without issues. All hotswap raid5 drives, so when broken, they or us just fixed it without downtime.

tyingq · on Dec 22, 2021

>AWS didn’t “go down”

The context of the parent seems to be that they intermittently couldn't get to the console. That seems fair to me. If we're blaming developers and finding gaps in HA design, then AWS should also figure out how to make the console url resilient. If it's not, then AWS does appear to be down.

I imagine it's pretty hard to design around these failures, because it's not always clear what to do. You would think, for example, that load balancers would work properly during this outage. They aren't. Or that you could deploy an Elasticache cluster to the remaining AZs. You can't. And I imagine the problems vary based on the AWS outage type.

Similarly, with the earlier broad us-east-1 outage, you couldn't update Route53 records. I don't think that was known beforehand by everyone that uses AWS. You can imagine changing DNS records might be useful during an outage.

strunz · on Dec 22, 2021

Except many AWS services still route through us-east-1 anyway, which is why they have had huge outages recently. AWS isn't as redundant as people think it is.

bencoder · on Dec 22, 2021

Our API is just appsync (graphql) + lambdas + dynamoDB so, theoretically, we shouldn't have been affected. But about 1 in 3 requests was just hanging and timing out.

As others have said, they are not being forthright about the severity of the issue, as is standard.

dkryptr · on Dec 22, 2021

100% agree. I'm actually surprised AWS hasn't built in a Chaos Monkey into their APIs/console so people can test their resiliency regularly if an AZ goes down.

edit: of course, AWS does have this: AWS Fault Injection Simulator

_k4hw · on Dec 22, 2021

AWS Fault Injection Simulator does this.

lljk_kennedy · on Dec 22, 2021

Is that what they call us-east-1 nowadays?

voidfunc · on Dec 22, 2021

us-east-1 has always been the canary region hasn't it?

dkryptr · on Dec 22, 2021

TIL. Thank you!

stingraycharles · on Dec 22, 2021

Because then people would complain about AWS being less reliable than Azure / GCP.

TameAntelope · on Dec 22, 2021

Here's a secret that's now saved me from three outages this month:

Be in multiple AZs, and even multiple regions but if you're going to be in only one AZ or one region, make it us-east-2.

IceWreck · on Dec 22, 2021

Honestly my server at home has more uptime than US-East-1

TacticalCoder · on Dec 22, 2021

I should blog about this one day but...

I have a server at OVH (not affiliated to them) which, at this point, I keep only for fun. It has 3162 days of uptime as I type this.

3 162 days. That's 8 years+ of uptime.

Does it have the traffic of Amazon? No.

Is it secure? Very likely not: it's running an old Debian version (Debian 7, which came out in, well, 2013).

It only has one port opened though, SSH. And with quite a hardened SSH setup at that.

I installed all the security patches I could install without rebooting it (so, yes, I know, this means I didn't install all the security patches for some required rebooting).

This server is, by now, a statement. It's about how stable Linux can be. It's about how amazingly stable Debian is. It's also about OVH: at times they had part of their datacenter burn (yup), at times they had full racks that had to be moved/disconnected. But somehow my server never got affected. It may have happened that at one point OVH had connectivity issues but my server went down.

I "gave back" many of my servers I didn't need anymore. But this one I keep just because...

I still use it, but only as an additional online/off-site backup where I send encrypted backups. It's not as if it gets zero use: I typically push backups to it daily.

They're only backups, they're encrypted. Even if my server is "owned" by some bad guys, the damage he could do is limited. Never seen anything suspicious on it though.

I like to do "silly" stuff like that. Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

I think it's about time I start to do some archeology on that server, to see what I can find. Apparently I installed Debian 7 on it in mid-october 2013.

I've created a temporary user account on it, which at times I've handle the password (before resetting it) to people just so they could SSH in and type: "uptime".

It is a thing of beauty.

Eight. Years. Of. Uptime.

nextaccountic · on Dec 22, 2021

> Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

Awesome! Are you Bernard Fabrot [0]?

[0] https://www.csail.mit.edu/news/programmers-solve-mits-20-yea...

TacticalCoder · on Dec 22, 2021

Yup that's me... I fear this (old by now) story blew my "tacticalcoder" cover.

kasey_junk · on Dec 22, 2021

I read this as a cautionary tale. Here we have a server that only through the grace of god is still up, and is likely owned up. If it isn't, it's because of how little is going on with it.

At its current use, it's likely not a major issue but imagine if someone saw this uptime and thought to take it as a statement of reliability and built a service on it. I for one, would want that disclosed because this is a disaster waiting to happen. I'd much rather someone disclose that they had a few servers each with no longer than 7 days of uptime because they'd been fully imaged and cycled in that time...

TacticalCoder · on Dec 22, 2021

It works both ways: it is also a cautionary tale for those who are prone to believe it's all unreliable cattle that needs constant restart because nothing is stable nor reliable...

plandis · on Dec 22, 2021

Your server could just be an outlier. Doesn’t really say anything about AWS or any cloud provider.

BossingAround · on Dec 22, 2021

Does your server at home handle similar traffic to that of US-East-1 since you're comparing uptime?

Simiarly, my laptop, if I keep it plugged in the wall, and enable httpd on localhost, will surely have better uptime than any of the top clouds. I'd bet that it'd have 100% uptime if I plugged in a UPS and cared for traffic on my local network only.

christophilus · on Dec 22, 2021

Most people don't need to handle the traffic of US-East-1. They just need a single, simple, mostly reliable server. But they're often told, "Don't do that. It's too hard, and irresponsible, and what if you get a spike in traffic, and what if you need to add 5 new servers, and security is really hard."

In reality, most people don't need to scale. An occasional spike in traffic is a nuisance, but not the end of the world, and security is not terribly hard, if you keep your servers patched (which is trivial to automate).

I really don't understand why there's so much FUD around running your own stuff.

ryanbrunner · on Dec 22, 2021

I think most people on here are coming from the perspective of startups, which scale out of a single server setup pretty quickly. At a bare minimum, most will have dedicated purpose-built servers like Redis or a DB, and often there's separate background workers, or a load balancer with a couple of web servers.

When your server requirements get into needing 5-6 servers (not at all atypical for a startup in their first year of being launched), running your own stuff becomes more of a challenge pretty quickly. Factor in 2-3x growth a year, and the challenges just mount.

doublerabbit · on Dec 22, 2021

> running your own stuff becomes more of a challenge pretty quickly. Factor in 2-3x growth a year, and the challenges just mount.

What challenges are you thinking of? You buy a full-rack in colocation and then just buy servers/hardware when required.

If a company has the budget for AWS or some other cloud provider then they would have a budget for colocation; which in long term is cheaper. I see no additional challenge other than maintaining X amount of hardware than just one.

manquer · on Dec 22, 2021

Long term is unknown to the startup , they may fail or pivot .

Buying upfront hardware is not feasible even if I had the cash(which most don't), I don't know if the company would last that long or would be doing things that require x servers .

What you are saying is similar to saying may be it is cheaper to buy the building /floor instead of renting space for office. - most small biz cannot afford do that, or expect their business to change (fail/take off) in the time frame ROI would come to take that commitment.

This is all assuming that a the startup has skill in setting up and managing physical servers and there is no opportunity costs( delayed features) on doing so, both are not a given.

small companies ( and poor people) typically don't buy low quality stuff or buy into rent seeking business models because they are dumb it is usually because they cannot afford to do long term thinking.

doublerabbit · on Dec 23, 2021

> Buying upfront hardware is not feasible even if I had the cash(which most don't)

You only need one server, slap on a hypervisor and your rocking. Heck you can buy entry servers from dell for budget; upgrade later.

A 10u rack, which is adequate for any small business comes to around $500 a month in LA. 4u would be more than enough for a startup and that's around $200/month.

Is the start-up not going to purchase computer hardware, monitors, television screens for their clients when they sit in the waiting room? Email accounts with Office365, a website, a domain name? If they can fit that in to their budget I am pretty sure they could afford a server and colocation space.

> What you are saying is similar to saying may be it is cheaper to buy the building /floor instead of renting space for office. - most small biz cannot afford do that, or expect their business to change (fail/take off) in the time frame ROI would come to take that commitment.

But colocation is dynamic. Contracts can be negotiated.

> buy into rent seeking business models

And AWS isn't a rent seeking business model?

Looking at EC2 instances, for 120GBHD, 32 cores "Dedicated" instance, your looking at 679.54 USD for a month. 120GB isn't much especially when the developers start doing their thing.

For $500 you can have so much more, and hardware you actually own and that if the company does not lift off it can then be sold. Is that not the better investment?

No remarks to lack of skills.