Goddamn AWS seems to be down again 4th time this month alone

yjftsjthsd-h · on Dec 24, 2021

> Goddamn AWS seems to be down again 4th time this month alone

I feel obligated to point out that this is a very editorialized title, which is against site guidelines. Not... that I don't sympathize, just that it's perhaps a bit out of line even for providing context.

walrus01 · on Dec 24, 2021

It may be editorialized but it's also exactly the words I would expect to come out of the mouths of some of my most experienced, senior, and clueful colleagues when asked to describe the present reliability of AWS US-EAST-1

codeduck · on Dec 24, 2021

Good old chaos-east-1

collegeburner · on Dec 24, 2021

True, but it's funny... perhaps next time a "Tell HN" with the link?

ksec · on Dec 24, 2021

I think it would have been fine if it had Tell HN in front of it. Without it the title would be against the rules and guidelines. One could argue the word Goddamn seems to be provocative. Which could be appropriate if it was really down. But it is not. I think @dang need to edit the title.

SubiculumCode · on Dec 24, 2021

It certainly could be editorialized. Alternatively, AWS may be in fact damned by god.

300bps · on Dec 24, 2021

I'm honestly sick of seeing these posts. Every single post I've seen on HN, I have had no down time in EC2, S3, Workspaces, FSx for Windows, Directory Service, Console, etc.

Over 200 services, 86 availability zones and 26 regions. We might as well post every 5 minutes if we post every time something on AWS goes down. And yes AWS is more dependent on us-east-1 but one of the outrages posts was for a minor outage in us-west-1.

walrus01 · on Dec 24, 2021

> Every single post I've seen on HN, I have had no down time

and I'm averaging six nines uptime over the past year on a $5 KVM virtual machine hosted at a sketchy hosting company, but that doesn't mean something catastrophic can't or won't happen unexpectedly

tata71 · on Dec 25, 2021

Measured by who?

walrus01 · on Dec 25, 2021

my own external-to-the-vm monitoring tools that poll its availability for answering pings and ssh, and its own system uptime.

note that I count the six nines as unplanned downtime, it's had less availability than that because I reboot it for a newer kernel and debian-stable system updates maybe every 4-5 months.

tyingq · on Dec 24, 2021

>one of the outrages posts was for a minor outage in us-west-1.

It was both us-west-1 and us-west-2, for 45 minutes or so. Was interesting for Auth0 customers, as they use us-west-2 as their primary, and us-west-1 is the failover location.

ineedasername · on Dec 24, 2021

Well, there really shouldn't be be a service-- even across all regions and products-- going down every five minutes. Especially because it would mean service that needed to be in multiple regions around the world would almost always be having some type of issue.

ineedasername · on Dec 24, 2021

Hmm, am I wrong in this? I admittedly don't work on things of worldwide or nationwide of that scale so if my statement isn't accurate I like to learn from it. Thanks for any additional insight & explanation.

gregjor · on Dec 24, 2021

You aren’t necessarily wrong, but there’s a difference between a service at AWS scale experiencing frequent problems (inevitable if you think about it) and what actual impact those problems have. Overall the vast majority of AWS users don’t experience frequent problems.

By analogy cars crash every few minutes in my city (Las Vegas), but I can’t conclude that indicates some inherent (or easily fixable) problem with the road infrastructure or the design of cars, or the skill of drivers. The frequent crashes rarely affect me.

ineedasername · on Dec 24, 2021

Thanks, that makes sense. I always like to know why I'm wrong, but getting that explanation can be a fine line to avoid looking like you're complaining about downvotes. I could care less about the votes, but I do want to learn. Much appreciated.

t8e56vd4ih · on Dec 25, 2021

nice try, Jeff ...

bool3max · on Dec 24, 2021

[flagged]

ineedasername · on Dec 24, 2021

No, but headlines are supposed to be neutral and not flame/click bait. Even actual headlines that go in that direction are frequently modified. Maybe a fix from @dang?

nefitty · on Dec 24, 2021

Jesus, let the man rest. It's Christmas Eve.

ineedasername · on Dec 24, 2021

Why does Christmas Eve play into this? It's a time of year where we try to be a bit more charitable, so why should I not expect more civility here?

And if the GP or original poster of this was seeming more rest on Christmas Eve then they could simply have avoided posting today.

Edit: I just realized you probably meant Dang? That's fair.

yjftsjthsd-h · on Dec 24, 2021

It's not about feelings; the site guidelines say,

> Please don't do things to make titles stand out, like using uppercase or exclamation points, or saying how great an article is. It's implicit in submitting something that you think it's important.

> [several other points, which don't apply here]

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.

And while I am sympathetic to adding meaningful information ("4th time this month" is technically against the guidelines but I wouldn't object), that's not adding anything more other than telling us how freakynit feels about it, which bluntly isn't useful.

on Dec 24, 2021

[dead]

kadoban · on Dec 24, 2021

It's not offensive, it's just not the appropriate tone for HN.

pictur · on Dec 24, 2021

[flagged]

dang · on Dec 24, 2021

"Please don't sneer, including at the rest of the community."

https://news.ycombinator.com/newsguidelines.html

chrisseaton · on Dec 24, 2021

Always remember that there are real people behind technology and services that you use.

pictur · on Dec 24, 2021

you can not be serious. Is aws really run by humans?

chrisseaton · on Dec 24, 2021

Take a look at the reaction to your comments - other people also don't want to see snide and unkind responses to things real people work on.

nefitty · on Dec 24, 2021

You felt obligated to the benefit of who, exactly?

rootusrootus · on Dec 24, 2021

The entire HN community, presumably.

nefitty · on Dec 24, 2021

But we're all literally looking at the title right now.

enkid · on Dec 24, 2021

He's pointing out the rule, not the title.

tata71 · on Dec 25, 2021

Go get a sip, ladeens.

Ostrogodsky · on Dec 24, 2021

forty · on Dec 24, 2021

I got worried when I saw this then checked https://status.aws.amazon.com/ and all seems good, I'm relieved ;)

gjsman-1000 · on Dec 24, 2021

I hope you are kidding because Amazon has been very late with updating that page in the past despite problems being obvious.

darwinwhy · on Dec 24, 2021

That's the joke.

mkl95 · on Dec 24, 2021

Some poor soul is in charge of updating that page manually.

msftie · on Dec 24, 2021

Directors and/or VPs of the various service organizations. It’s not a matter of it being a technical problem. It’s a business problem. The reported status has business consequences.

midasuni · on Dec 24, 2021

Can’t update the page if the updating service is down.

tata71 · on Dec 25, 2021

F for respects.

roozbeh18 · on Dec 24, 2021

no fun if you have to explain it haha

rosndo · on Dec 24, 2021

For once the AWS status page is actually telling the truth, there is no outage.

rosndo · on Dec 24, 2021

There is no AWS outage, just routing issues with shit asian ISPs affecting connectivity to many providers. Amazon has nothing whatsoever to do with this.

Imagine Comcast or Verizon fucking up their configs, you might not be able to access AWS stuff but that doesn’t mean that AWS is down.

But hey, the real story wouldn’t make the front page, so lets just stick to the fabricated narrative.

sathyabhat · on Dec 24, 2021

> 10:41 AM PST Between 8:59 AM and 9:32 AM PST and between 9:40 AM and 10:16 AM PST we observed Internet connectivity issues with a network provider outside of our network in the AP-SOUTH-1 Region. This impacted Internet connectivity from some customer networks to the AP-SOUTH-1 Region. Connectivity between EC2 instances and other AWS services within the Region was not impacted by this event. The issue has been resolved and we continue to work with the external provider to ensure it does not reoccur.

https://status.aws.amazon.com/#AP_block

mr-karan · on Dec 24, 2021

AS6453 is having a mega outage according to https://www.thousandeyes.com/outages/. Lot of services for users in India were inaccessible for ~1h or so.

cowsandmilk · on Dec 24, 2021

So not AWS, but internet outage in general in India?

markus_zhang · on Dec 24, 2021

Some cloudfare seem to be down too?

gregjor · on Dec 24, 2021

These threads always remind me of Umberto Eco's essay Sports Chatter. He describes the degress of participation in sport:

1. Playing the game yourself.

2. Watching other people play.

3. Talking about people playing the game.

4. Listening to someone else talk about people playing the game (i.e. sports commentators).

5. Talking about what sports commentators have to say.

In sports and in threads like this we're mostly at 3, 4, or 5. Even people who work at AWS posting here may only be "watching," i.e. working with second-hand information. While it's entertaining to nerds to speculate about root causes and possible solutions, and the dire predicted fallout from problems (especially when they happen to companies they dislike), it's just chatter and doesn't actually address any real issue, or inform any decisions that mean anything.

altdataseller · on Dec 24, 2021

It’s Christmas Eve, I aint answering any PagerDuty notification today or tomorrow.

gumby · on Dec 24, 2021

If pager duty uses aws you won’t have a problem

pmarreck · on Dec 24, 2021

DNSception!

tyingq · on Dec 24, 2021

You could just borrow the AWS wording.

Say that you're answering the on-call pager, but with "elevated latency".

Shorn · on Dec 25, 2021

"My standard Service Level Agreement for on-call is an average incident response time of < 24 hours (including weekends). Leave requirements may see periods of elevated response times."

In other words, I'll start to work on your problem at 0800 local time the next work day.

Krisjohn · on Dec 24, 2021

Yoink

Shorn · on Dec 25, 2021

Succinct. Yoink.

voidfunc · on Dec 24, 2021

[flagged]

Macha · on Dec 24, 2021

Yes, they'll understand if the impact is worth paying someone(s) for 24/7/365 support, and so far they've decided the answer is no. In my company's industry, it probably is correct - our customers don't work christmas eve/day either. At my previous company they did - and so oncall and its compensation was specified in the contract and there was a formal rota.

aaomidi · on Dec 24, 2021

It's funny, SREs and SDEs generally get paid around the same in most places. I know some companies compensate oncall time differently, but a lot don't.

SREs work a loooot more hours than SDEs, and I feel like that's partly on SREs for just accepting that workload.

In my current position, if I get woken up at night even once, I take that as 8 hours worked and take a whole day off. The other side of offering "unlimited time off" :). This creates a real cost on technical debt, and forces actually addressing them rather than letting human capital deal w/ them.

lumost · on Dec 24, 2021

Curiously this made my transition from SRE to SDE somewhat interesting. After I transitioned I kept working the same hours I would as an SRE and would be surprised at the lack of output from my coworkers. It wasn't until later that I realized it was simply the difference in hours worked.

As an SRE, your only option for climbing out of the mess is to work harder/smarter/extra, and you are always lowest on the totem pole when messes happen. You either learn to keep up with the crazy schedules, or get the hell out of dodge.

aaomidi · on Dec 24, 2021

I actually transitioned from SDE to SRE, and you can imagine how awful it was for me doing the opposite.

SREs are the hardest working engineers I know, but unfortunately they're not given a lot of autonomy to fix problems. Or one SREs has strong opinions about stuff and blocks fixing things.

My conclusion has been that everyone needs to be devops. SDE needs to also own the ops of the thing they've developed.

dvtrn · on Dec 24, 2021

My conclusion has been that everyone needs to be devops

I used to think this way.

But time and time again I found that organizations treated Devops and SRE as either "developer help desk" or "hiring some person to do terraform stuff and calling them the 'DevOps Team'" or "we literally failed to hire actual technical specialists for the role we needed, so we're just going to dump these tasks on the Devops 'team'"

I no longer think the way I used to about Devops, I also no longer have any real interest in being part of 'Devops transformation' efforts. Go figure.

dragonwriter · on Dec 24, 2021

> My conclusion has been that everyone needs to be devops

DevOps wasn't conceived as a new and additional role or org, but as a model, and a model whose main point was eliminating inefficiencies resulting from role/org/knowledge/responsibility separations between dev and ops.

Creating a new role, often in its own sub organization, as a distinct knowledge and responsibility silo for “DevOps" is like management instituting a rigid top-down development model without dev team ownership and control or customer participation and calling it “Agile”:

Completely missing the point and also exactly how most places do it.

geofft · on Dec 24, 2021

Also, quite literally, people (both management and other engineers) often don't fully think through the consequences of systems that have neither enough automatic failover nor enough staffing by non-distracted humans, and build systems assuming that it will be fine. It's probably better to suffer an unexpected outage now than in the future when your company is bigger.

5e92cb50239222b · on Dec 24, 2021

>rota

I apologize for offtop, but are you British? I have only seen this word in the British literature and now I'm wondering if Americans or Canadians or Australians ever use it.

Kye · on Dec 24, 2021

US here and I've never heard the word. I thought it might be an acronym until I read your comment.

reallydontask · on Dec 24, 2021

What do you call a rota?

Seems a word in common use in the UK

toyg · on Dec 24, 2021

I guess they use something like "duty" or "shift plan".

Rota is indeed very British. It clearly comes from the Latin word for "wheel", but in Italy it is not used (nor in France, afaik) - we have the Sacra Rota (which is the Catholic ecclesiastical tribunal, typically invoked for marriage dissolutions) and the word rotazione which is "rotation" and can indeed be used to indicate shift planning. Now I wonder if "rota" in that context comes as a shortening of rotation...

eropple · on Dec 24, 2021

"Rotation", in the US. I've never heard "rota" as shorthand before. That's interesting!

kiwijamo · on Dec 24, 2021

Here in New Zealand we usually use 'roster'.

Macha · on Dec 24, 2021

Nope, Irish

omginternets · on Dec 24, 2021

Man there’s always someone to suck up to the boss, even when it’s not his own. I really struggle to understand what this moralizing is meant to achieve.

aaomidi · on Dec 24, 2021

They can go and respond if something is that important.

Oh, wait, what's that? They don't know how to? Interesting, they seem to be getting paid a shit ton more than the engineers keeping everything running.

lumost · on Dec 24, 2021

by and large managers are only paid 10% extra. This 10% is applied at the band level, it's incredibly common for the engineers working for a manager to make more than they do.

aaomidi · on Dec 24, 2021

It really depends on the org I think. But Boss's Boss probably gets paid 40-50% more.

ohgodplsno · on Dec 24, 2021

Imagine thinking your boss's business is more important than taking time off with your family (or alone) on Christmas.

voidfunc · on Dec 24, 2021

/r/antiwork is leaking into HN I see.

I'm paid to keep the lights on as well as build new lights and light features. Maybe you aren't, but if you're on an on-call rotation my guess is you probably are and are expected to respond to incidents if they arise regardless of what day it is.

ohgodplsno · on Dec 25, 2021

Please, antiwork is too recent for my tastes, I've been shitting on my bosses for a decade now. I've been on call too. And nothing is worth dropping the one time your family might actually be together during the year to fix a damn server going down. My bosses make a dollar and give me a cent on that while expecting me to be on call? Fuck them.

peakaboo · on Dec 24, 2021

Does the boss pay?

inaprovaline · on Dec 24, 2021

It takes years to build trust with your customers, and very little to disrupt it. I think these recent events will have people rethink their cloud strategy. This could be a good opportunity for Google to take on Amazon.

moonchrome · on Dec 24, 2021

I'm not usually the person to complain about UI performance but Google cloud admin is the slowest trash I had to use in a while, even Azure is better. So here's me hoping it doesn't pick up steam.

motoboi · on Dec 24, 2021

Azure CLI is absolutely insanely bad.

Google CLI is great.

Aws is trying to get better.

GUI on the three sites is awful.

pensatoio · on Dec 24, 2021

I used to complain about the AWS UI, then I switched to GCP...

Alcha · on Dec 24, 2021

The company I work for switched from Azure to GCP at the start of the year and personally I don't agree with this statement. Azure was a royal pain in the ass compared to GCP, which I find to be rather smooth and responsive.

Maybe you're going off of past experiences and it's since gotten better?

monkeybutton · on Dec 24, 2021

The GCP admin console is painfully slow and yet is still somehow faster than the new Google Ads UI.

delfinom · on Dec 24, 2021

The GCP admin console once created an impossible to delete cloud function and I spent 2 months convincing support that I knew how to click a button.

remote-dev · on Dec 24, 2021

It has been awhile since I used the GCP admin, but I am intimately familiar with how awful the Azure UI is... is it really better than GCP? I have never liked how Azure seems to be designed as if you're in a virtual machine or something. The horizontal scrolling is a mess.

spmurrayzzz · on Dec 25, 2021

All you have to do is go back to the outage/event timeline for GCP to know exactly how ill-informed of a decision that would be. (Not to mention all the other product offering differences)

tata71 · on Dec 25, 2021

Or for offline-friendly webapps to flourish?

tyingq · on Dec 24, 2021

I would be curious to hear what any current or former AWS employees think the internal consequences become for this.

From the outside, it feels like they are going to have to do something different to get some customer confidence back. Some sort of "mea culpa" with an explanation of what they are going to change.

oonis · on Dec 24, 2021

It’ll be a lot of finger pointing and teams covering their own ass

tyingq · on Dec 24, 2021

That seems to be the status quo. I'm curious if the current trend is enough to break that tradition.

sgt · on Dec 24, 2021

Or maybe breaking stuff is celebrated? For every type of downtime there will be lessons learned.

cconstantine · on Dec 24, 2021

Every place I'v worked has celebrated fixing things that have broken, and forces engineers to fight for prevention. It's not exactly celebrating things breaking, but it sure looks a lot like it.

Arainach · on Dec 24, 2021

That's only true if lessons are learned. If an outage happens and you do nothing to fix it (or worse, do something entirely too specific without fixing the real problem and thinking you've done it) the exact same problem will happen again and you'll get another opportunity to not learn the lesson.

awsthrow8264 · on Dec 24, 2021

Can't speak for other teams, but the dynamics in mine are pretty great. We don't overwork and leadership puts explicit emphasis on work/life quality.

From what I've heard in other teams (including some that own the big services you're familiar with), it can be a shitshow. Oncalls get paged 10*N times a week (typically 24/7 2-week rotation with two active oncalls, but depends), teams are leaking talent, desperately trying to hire good engineers while keeping the lights on.

gregjor · on Dec 24, 2021

Apologies are not actionable. Service providers promise some number of 9s percent uptime, not zero problems. It should be obvious that perfection is not achievable, and that at scale small problems can get magnified.

Customer confidence is only an issue if there are alternative providers that don’t have problems. That isn’t the case with cloud hosting.

tyingq · on Dec 24, 2021

I'm unclear on what you're responding to. I wasn't saying an apology was the right remedy, nor did I suggest they should have zero issues.

Customer confidence is an issue because they've had 4 notable issues in a very short timeframe.

Edit: I think you're reading something into my comments that's more than what I've said. I am curious what the AWS response to 4 major incidents in a month's time will be. That's it. At least in my circles, it is an issue for customers. I respect that it appears not to be in your circles.

Ahh. By "response", I did not mean the incident summary. I meant the overarching company response, if any. Like policies around change control, capacity additions, and so on.

gregjor · on Dec 24, 2021

The issues affected only a fraction of customers. Long threads on HN speculating about the causes and armchair quarterbacking have little to do with customer confidence.

All of the big cloud providers have problems, so unless there's a service with similar offerings and reach with demonstrably better reliability customer confidence is not going to matter -- people will use the least bad of the available choices.

I have multiple customers hosting on AWS, and a couple on GCP, and one on Azure. Azure by far has the most problems in my limited experience. I have servers on AWS (US East Ohio and US West Oregon) with 600+ days of uptime. None of my customers except the one on Azure have ever contacted me about an outage, in two or three years. At the same time I can read long threads like this on HN full of doom and gloom and predictions about the decline of AWS. Which customers are we talking about? I know I'm not the customer for AWS -- I don't pay for hosting.

Having moved more than a dozen companies from self-hosting and co-located hosting to AWS (or GCP, Azure if that's what they want) I can say that they are 100% happier. Relatively infrequent outages that AWS has the resources and incentive to jump all over and fix are preferable to trying to get me or some other pissed off engineer to drive in and try to figure things out.

If quality, reliability, and frequent outages drove customers away then Tesla would have withered up years ago. There are more factors in play here than what a small number of committed tech geeks (like me) think about how AWS could be doing things better.

gregjor · on Dec 24, 2021

The responses are basically corporate form letters: We're sorry, here's the (undecipherable) root cause, here's what we (wish/hope/pray for) will happen going forward. Those usually include some unverifiable numbers that downplay the severity, duration, and number of affected sites.

HN has lots of these apologies that seem insincere: Here's what happened, here's how we fixed it, here's what we will do to prevent this happening again. It's a PR exercise, not something that necessarily improves my confidence.

Any complex technology at AWS scale is going to have outages and mysterious problems and glitches, all the time. That's inherent to both technology and human organizations, and the HN crowd should understand that better than lay people. These threads mainly serve as launch points for endless armchair diagnostics and proposed solutions from people who have no skin in the game.

manojlds · on Dec 24, 2021

Jeff Bezos comes back /s

theossuary · on Dec 24, 2021

Complete speculation; but I wonder if how Amazon works their employee's so hard is the root cause of this. Layer on COVID as an additional stressor and give it a few years and this could be the result. Curious to hear from those at Amazon, what's it been like during COVID?

loeg · on Dec 24, 2021

They're losing engineers left and right, especially in H2 2021. It's probably all related. Their recruiters' emails are increasingly desperate. If I was considering a job at Amazon now, I would be very interested in a good answer for how they plan to address their attrition problem (and paying new hires well isn't a viable answer alone).

smolder · on Dec 24, 2021

All they need to do to hire me is lay off the draconian IP assignment clauses or give me an explicit exception for side projects. That's what I tell the desperate recruiters, and then they go away until the next one comes along.

Alcha · on Dec 24, 2021

I'm not familiar with their draconian IP assignment, do you have any references I can read on this? Genuinely curious.

smolder · on Dec 24, 2021

There was an article on The Register[0] about something especially nutty Amazon Games Services required of employees, then later backed off on. (Cheers to them for caving to pressure.) Apart from that I've got no good references. I believe AWS typically put the "usual" overbearing clause in their contracts, where any IP you create before or during your employment becomes theirs. Sadly, so do many other employers. My previous employer did, and it was a sticking point for me and one reason for leaving.

[0] https://www.theregister.com/2021/08/13/amazon_game_contracts...

servercobra · on Dec 24, 2021

They have been getting desperate, much more so than my quarterly Facebook/Google email. I've also had multiple Amazon recruiters reaching out, whereas I usually have a single Facebook/Google one. But apparently they aren't desperate enough to allow remote work after Covid, so I keep reminding them I'm not interested in non-remote work.

ridaj · on Dec 24, 2021

Seems like every company in tech is losing engineers left and right at this point

micromacrofoot · on Dec 24, 2021

I know a few companies that treat their engineers very well (remote, good pay, equity, no overtime, ~6 weeks vacation) and they’ve barely lost anyone.

Companies like Amazon should learn from this, but they likely won’t unless their employees unionize.

ridaj · on Dec 24, 2021

Can you share names? Asking for a friend ... Although if they are small companies, I would be wary of anecdata... The narrative sounds good but small companies are more likely than large companies to exhibit extreme behavior in any metric, and it's likely that there's some confirmation bias in reporting what works / doesn't work

micromacrofoot · on Dec 24, 2021

that’s true, and unless you trust someone in the inside you can’t rely on a company reporting accurately about themselves… plus the tide at a small company can turn in an instant, just look at basecamp

nefitty · on Dec 24, 2021

I keep getting AWS recruiters contacting me and I'm pretty sure I am nowhere near qualified

loeg · on Dec 24, 2021

I think it’s worse at Amazon, but I don’t have any hard data to support that belief.

nickjj · on Dec 24, 2021

> Complete speculation; but I wonder if...

If we're going into pure speculation mode I think we might also want to include that maybe their outages are staged and have factored in that the temporary bad press and paying out SLA credits is more profitable because it guides existing customers to start scaling to multiple AWS regions to offset these infrequent downtime events in 1 specific region at a time.

For example if you have hosting on US West but now to help reduce time down you duplicate your infrastructure in the EU region then I'm pretty sure the outcome here is if you were paying $6,000 a month before now you would be paying $12,000 a month (+/- any regional cost differences).

Realistically I don't believe this is the case but it's not impossible.

jen20 · on Dec 24, 2021

Likely more because of internet-region data transit costs.

dv_dt · on Dec 24, 2021

The rule of thumb of efficiency is that it can make things worse when it breaks. How that applies to organizations is if you have an incredible org with a higher plates to plate spinner ratio than ever, then if you have a catastrophe, the org may not have enough people to actually get the plates spinning again as fast as a less efficient, larger crew with more reserve.

efitz · on Dec 24, 2021

AWS was always stressful. Great people for the most part. Culture had strong pros and strong cons. Same for treatment of employees; mixed bag.

During COVID everyone who didn’t need to be in the office worked from home. That was sometimes more stressful as you had lots more time in video meetings and no hallway interactions to make quick decisions on easy stuff.

For the problem reports today I am in wait and see mode. It’s not clear to me yet that AWS is the problem; could be a six-pack-and-a-backhoe kind of problem with a network provider.

valleyjo · on Dec 24, 2021

Attrition has been super high during Covid and they raised the pay a lot for new offers to try and attract talent. (source: I got an offer for S3 mid-Nov. this is what the hiring manager told me).

JCM9 · on Dec 24, 2021

…or there are Internet issues that are causing issues for people. Do we actually know AWS is having issues or people are speculating?

Not seeing any issues here, but am seeing people reporting broader internet issues at the moment. Post title seems a bit quick on the trigger to point blame.

ferdowsi · on Dec 24, 2021

Why isn't there any investigative journalism looking into wtf is happening at AWS? What the point of all these tech journalists writing all those puff pieces for access if they're not going to use it to pierce the NDA shield at times like this?

gregjor · on Dec 24, 2021

Amazon has been under the spotlight for labor practices, safety, wages. I’m sure the journalists can still order stuff from Amazon.

As a programmer I can probably guess how AWS experiences problems: programming and networking are hard problems, they get exponentially harder at scale, and there’s no known way to prevent every problem.

If AWS or some journalists gave us details we would just see 200 posts about how they could have done it better with Rust, nothing actually useful for anyone.

johnday · on Dec 24, 2021

If there is sufficiently competent investigative journalism going on, you wouldn't know it either way until a report was released.

hotpotamus · on Dec 24, 2021

I wonder how many are hosted on AWS and if that has any influence on whether they'd be willing to investigate.

albert_e · on Dec 24, 2021

Being merely hosted on AWS should not be an influence at all.

hotpotamus · on Dec 24, 2021

I agree that it should not, but would you want to bite the hand that hosts you?

joconde · on Dec 24, 2021

“Breaking: AWS censors website that criticized them”

As if they didn’t already get enough negative attention.

chrisseaton · on Dec 24, 2021

You can't figure these kind of outages out from the outside, and Amazon aren't going to explain it you except for their official explanations. Plus - 'access' means doing what the subject tells you, certainly not doing anything investigative.

noodlesUK · on Dec 24, 2021

This is getting to be a massive joke. Any reputation AWS has for being robust has been wiped out over the past month or so. I feel bad for the various engineers at AWS and other companies who are having to work on their day off.

throwaway82931 · on Dec 24, 2021

I've always heard about technical debt at AWS, with thousand line functions everyone is terrified to touch, demotivated employees doing whatever it takes to close tickets without dealing with the root causes, and so on.

I presume that this rash of downtime is just the rickety structure inevitably creaking and breaking. It's probably too late now to fix things — dealing with legacy code requires patience, discipline, and understanding which the management of Amazon doesn't have. They'll just yell at people louder, and hold people "accountable" by punishing anyone where anything breaks.

goldenkey · on Dec 24, 2021

I worked there and can vouch for what you said. AWS is built like a house of cards. They just painted it orange and white to make it look cohesive. What's funny though is that the CSS isn't even sourced from one place. They tell front-end engineers to use a dropper tool and to just copy and paste style from other places. DRY is not in Amazon's vocabulary.

Gotta love their leadership principles:

> Frugality (we'll give you a shitty laptop, shitty chair, and a shitty desk.)

> Be right a lot (are you too stupid to forecast the future?!)"

dralley · on Dec 24, 2021

Bryan Cantrill did an interesting deconstruction on Amazon leadership principles.

https://m.youtube.com/watch?v=9QMGAtxUlAc&t=26m42s

goldenkey · on Dec 25, 2021

That was amazingly on point and had me laughing hysterically. Thank you.

bruce343434 · on Dec 24, 2021

So this is the 4th time an uninteresting update like this has come to the front page... Why? What can be gathered from it? Get the AWS post-mortem on here, for that I'll be piqued. HN is not a status page. I don't instinctively check HN when my service is down because why would I?

mplanchard · on Dec 24, 2021

As a counterpoint, HN is one of the first places I check if a big service seems to be down, because it usually shows up here before their status pages.

gowthamgts12 · on Dec 24, 2021

looks like it's a connectivity outage across all services - https://www.thousandeyes.com/outages/

darkwater · on Dec 24, 2021

Nice link, didn't know it. Thanks!

gjsman-1000 · on Dec 24, 2021

Amazon’s SLAs are quickly becoming only useful for kindling.

At least your fireplace will give some warmth for your relatives. Merry Christmas!

organic_popcorn · on Dec 24, 2021

I have been wondering if Charlie Bell leaving AWS has anything to do with these recent outages. His ops meetings were fucking brutal. I’m curious how they are since he left.

greatpostman · on Dec 24, 2021

A lot of senior AWS employees have left. What the root causes of these outages are is a well guarded secret. I work at Amazon and no one I know knows anything in detail about what went down and why

qbasic_forever · on Dec 24, 2021

Anyone deploying to production Christmas eve, or even the week of Christmas, should have to get direct approval from the CTO of the company to do so. Just a terrible, terrible idea.

oleg_antonyan · on Dec 24, 2021

perfect idea if you're small b2b company b/c the trafic is super low

yosefjaved1 · on Dec 24, 2021

According to ThousandEyes, it looks like it was AS16509 that was affected. The outage lasted for 28 min; however, it didn't seem to affect many applications for too long.

I think it's because there was a backup server that kicked in for AS16509; however, it also went down but only for 8 min.

https://www.thousandeyes.com/outages/

yosefjaved1 · on Dec 24, 2021

According to ThousandEyes, the bigger outage though is TATA Communications server AS6453. That was down or at least having issues of some sort for 1 hr 11 min.

kw-maller · on Dec 24, 2021

I don't know about "4th time this month alone" which seems a bit editorialised, but our company is off AWS for now, and most likely for good.

We're not in the cloud or tech business per se, and as such our customers are not really understanding of technical issues which unfortunately means they are blaming us, and our own reputation is on the line because of AWS' shortcomings.

We did consider Google but for now OVH is the only major provider which is both reliable and secure (w.r.t. court-and-gag letters from government and intelligence agencies) as far as we're concerned. We still use Hetzner and Scaleway for some older stuff and also because of Scaleway's low prices, but it's likely we're moving everything to OVH in the future.

k__ · on Dec 24, 2021

That's because they're running on-premise.

dainiusse · on Dec 24, 2021

That's an old joke already

voidfunc · on Dec 24, 2021

Wonder how much of this is caused by engineers rushing to fix log4j issues at AWS so they can enjoy the holidays? Their stack is heavily Java-based isnt it?

tandymodel100 · on Dec 24, 2021

Unclear on this one, but wasn’t the last one caused by power failure in an AZ?

wlonkly · on Dec 25, 2021

Log4j on the transfer switch, then!

tomcooks · on Dec 24, 2021

Happens when you delegate, don't give them money.

bgroat · on Dec 24, 2021

There's a video that's being going around HN lately about civilization collapsing and the loss of the ability to maintain robust systems.

Now that I'm looking for it I see it everywhere.

BackBlast · on Dec 25, 2021

I would like more information about this. This is the first that I have seen this video mentioned.

skyzyx · on Jan 2, 2022

Pro-Tip: Get the hell out of US-East-1.

markus_zhang · on Dec 24, 2021

212 cases doesn't seem to be a lot. Does anyone know if it's legit? If they are burning left and right the 9s are going to leave soon.

ryrymcfly · on Dec 27, 2021

It's almost as though it were a bad idea to put all the eggs in one basket

blibble · on Dec 24, 2021

meanwhile my raspberry pi cluster at home on residential FTTP has 100% uptime over the last 2 years

(not entirely serious)

authed · on Dec 24, 2021

But you know it wouldn't if it had to rely on your home connection AND Amazon. It's like all those major website that rely on 10 domains to serve one webpage...

tillinghast · on Dec 24, 2021

This looks to be an ISP issue and not AWS.

mushufasa · on Dec 24, 2021

what the hell this is unacceptable. is this some sort of sabotage? AWS may have to live this down for years in RFPs

sergiotapia · on Dec 24, 2021

I never ever want to hear people disparage DigitalOcean or Linode ever again. The "cloud" has now been objectively proven to be just as flakey.