> Goddamn AWS seems to be down again 4th time this month alone
I feel obligated to point out that this is a very editorialized title, which is against site guidelines. Not... that I don't sympathize, just that it's perhaps a bit out of line even for providing context.
It may be editorialized but it's also exactly the words I would expect to come out of the mouths of some of my most experienced, senior, and clueful colleagues when asked to describe the present reliability of AWS US-EAST-1
I think it would have been fine if it had Tell HN in front of it. Without it the title would be against the rules and guidelines. One could argue the word Goddamn seems to be provocative. Which could be appropriate if it was really down. But it is not. I think @dang need to edit the title.
I'm honestly sick of seeing these posts. Every single post I've seen on HN, I have had no down time in EC2, S3, Workspaces, FSx for Windows, Directory Service, Console, etc.
Over 200 services, 86 availability zones and 26 regions. We might as well post every 5 minutes if we post every time something on AWS goes down. And yes AWS is more dependent on us-east-1 but one of the outrages posts was for a minor outage in us-west-1.
> Every single post I've seen on HN, I have had no down time
and I'm averaging six nines uptime over the past year on a $5 KVM virtual machine hosted at a sketchy hosting company, but that doesn't mean something catastrophic can't or won't happen unexpectedly
my own external-to-the-vm monitoring tools that poll its availability for answering pings and ssh, and its own system uptime.
note that I count the six nines as unplanned downtime, it's had less availability than that because I reboot it for a newer kernel and debian-stable system updates maybe every 4-5 months.
>one of the outrages posts was for a minor outage in us-west-1.
It was both us-west-1 and us-west-2, for 45 minutes or so. Was interesting for Auth0 customers, as they use us-west-2 as their primary, and us-west-1 is the failover location.
Well, there really shouldn't be be a service-- even across all regions and products-- going down every five minutes. Especially because it would mean service that needed to be in multiple regions around the world would almost always be having some type of issue.
Hmm, am I wrong in this? I admittedly don't work on things of worldwide or nationwide of that scale so if my statement isn't accurate I like to learn from it. Thanks for any additional insight & explanation.
You aren’t necessarily wrong, but there’s a difference between a service at AWS scale experiencing frequent problems (inevitable if you think about it) and what actual impact those problems have. Overall the vast majority of AWS users don’t experience frequent problems.
By analogy cars crash every few minutes in my city (Las Vegas), but I can’t conclude that indicates some inherent (or easily fixable) problem with the road infrastructure or the design of cars, or the skill of drivers. The frequent crashes rarely affect me.
Thanks, that makes sense. I always like to know why I'm wrong, but getting that explanation can be a fine line to avoid looking like you're complaining about downvotes. I could care less about the votes, but I do want to learn. Much appreciated.
No, but headlines are supposed to be neutral and not flame/click bait. Even actual headlines that go in that direction are frequently modified. Maybe a fix from @dang?
> Please don't do things to make titles stand out, like using uppercase or exclamation points, or saying how great an article is. It's implicit in submitting something that you think it's important.
> [several other points, which don't apply here]
> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.
And while I am sympathetic to adding meaningful information ("4th time this month" is technically against the guidelines but I wouldn't object), that's not adding anything more other than telling us how freakynit feels about it, which bluntly isn't useful.
Directors and/or VPs of the various service organizations. It’s not a matter of it being a technical problem. It’s a business problem. The reported status has business consequences.
There is no AWS outage, just routing issues with shit asian ISPs affecting connectivity to many providers. Amazon has nothing whatsoever to do with this.
Imagine Comcast or Verizon fucking up their configs, you might not be able to access AWS stuff but that doesn’t mean that AWS is down.
But hey, the real story wouldn’t make the front page, so lets just stick to the fabricated narrative.
> 10:41 AM PST Between 8:59 AM and 9:32 AM PST and between 9:40 AM and 10:16 AM PST we observed Internet connectivity issues with a network provider outside of our network in the AP-SOUTH-1 Region. This impacted Internet connectivity from some customer networks to the AP-SOUTH-1 Region. Connectivity between EC2 instances and other AWS services within the Region was not impacted by this event. The issue has been resolved and we continue to work with the external provider to ensure it does not reoccur.
These threads always remind me of Umberto Eco's essay Sports Chatter. He describes the degress of participation in sport:
1. Playing the game yourself.
2. Watching other people play.
3. Talking about people playing the game.
4. Listening to someone else talk about people playing the game (i.e. sports commentators).
5. Talking about what sports commentators have to say.
In sports and in threads like this we're mostly at 3, 4, or 5. Even people who work at AWS posting here may only be "watching," i.e. working with second-hand information. While it's entertaining to nerds to speculate about root causes and possible solutions, and the dire predicted fallout from problems (especially when they happen to companies they dislike), it's just chatter and doesn't actually address any real issue, or inform any decisions that mean anything.
"My standard Service Level Agreement for on-call is an average incident response time of < 24 hours (including weekends). Leave requirements may see periods of elevated response times."
In other words, I'll start to work on your problem at 0800 local time the next work day.
Yes, they'll understand if the impact is worth paying someone(s) for 24/7/365 support, and so far they've decided the answer is no. In my company's industry, it probably is correct - our customers don't work christmas eve/day either. At my previous company they did - and so oncall and its compensation was specified in the contract and there was a formal rota.
It's funny, SREs and SDEs generally get paid around the same in most places. I know some companies compensate oncall time differently, but a lot don't.
SREs work a loooot more hours than SDEs, and I feel like that's partly on SREs for just accepting that workload.
In my current position, if I get woken up at night even once, I take that as 8 hours worked and take a whole day off. The other side of offering "unlimited time off" :). This creates a real cost on technical debt, and forces actually addressing them rather than letting human capital deal w/ them.
Curiously this made my transition from SRE to SDE somewhat interesting. After I transitioned I kept working the same hours I would as an SRE and would be surprised at the lack of output from my coworkers. It wasn't until later that I realized it was simply the difference in hours worked.
As an SRE, your only option for climbing out of the mess is to work harder/smarter/extra, and you are always lowest on the totem pole when messes happen. You either learn to keep up with the crazy schedules, or get the hell out of dodge.
I actually transitioned from SDE to SRE, and you can imagine how awful it was for me doing the opposite.
SREs are the hardest working engineers I know, but unfortunately they're not given a lot of autonomy to fix problems. Or one SREs has strong opinions about stuff and blocks fixing things.
My conclusion has been that everyone needs to be devops. SDE needs to also own the ops of the thing they've developed.
My conclusion has been that everyone needs to be devops
I used to think this way.
But time and time again I found that organizations treated Devops and SRE as either "developer help desk" or "hiring some person to do terraform stuff and calling them the 'DevOps Team'" or "we literally failed to hire actual technical specialists for the role we needed, so we're just going to dump these tasks on the Devops 'team'"
I no longer think the way I used to about Devops, I also no longer have any real interest in being part of 'Devops transformation' efforts. Go figure.
> My conclusion has been that everyone needs to be devops
DevOps wasn't conceived as a new and additional role or org, but as a model, and a model whose main point was eliminating inefficiencies resulting from role/org/knowledge/responsibility separations between dev and ops.
Creating a new role, often in its own sub organization, as a distinct knowledge and responsibility silo for “DevOps" is like management instituting a rigid top-down development model without dev team ownership and control or customer participation and calling it “Agile”:
Completely missing the point and also exactly how most places do it.
Also, quite literally, people (both management and other engineers) often don't fully think through the consequences of systems that have neither enough automatic failover nor enough staffing by non-distracted humans, and build systems assuming that it will be fine. It's probably better to suffer an unexpected outage now than in the future when your company is bigger.
I apologize for offtop, but are you British? I have only seen this word in the British literature and now I'm wondering if Americans or Canadians or Australians ever use it.
I guess they use something like "duty" or "shift plan".
Rota is indeed very British. It clearly comes from the Latin word for "wheel", but in Italy it is not used (nor in France, afaik) - we have the Sacra Rota (which is the Catholic ecclesiastical tribunal, typically invoked for marriage dissolutions) and the word rotazione which is "rotation" and can indeed be used to indicate shift planning. Now I wonder if "rota" in that context comes as a shortening of rotation...
Man there’s always someone to suck up to the boss, even when it’s not his own. I really struggle to understand what this moralizing is meant to achieve.
They can go and respond if something is that important.
Oh, wait, what's that? They don't know how to? Interesting, they seem to be getting paid a shit ton more than the engineers keeping everything running.
by and large managers are only paid 10% extra. This 10% is applied at the band level, it's incredibly common for the engineers working for a manager to make more than they do.
I'm paid to keep the lights on as well as build new lights and light features. Maybe you aren't, but if you're on an on-call rotation my guess is you probably are and are expected to respond to incidents if they arise regardless of what day it is.
Please, antiwork is too recent for my tastes, I've been shitting on my bosses for a decade now. I've been on call too. And nothing is worth dropping the one time your family might actually be together during the year to fix a damn server going down. My bosses make a dollar and give me a cent on that while expecting me to be on call? Fuck them.
It takes years to build trust with your customers, and very little to disrupt it. I think these recent events will have people rethink their cloud strategy. This could be a good opportunity for Google to take on Amazon.
I'm not usually the person to complain about UI performance but Google cloud admin is the slowest trash I had to use in a while, even Azure is better. So here's me hoping it doesn't pick up steam.
The company I work for switched from Azure to GCP at the start of the year and personally I don't agree with this statement. Azure was a royal pain in the ass compared to GCP, which I find to be rather smooth and responsive.
Maybe you're going off of past experiences and it's since gotten better?
It has been awhile since I used the GCP admin, but I am intimately familiar with how awful the Azure UI is... is it really better than GCP? I have never liked how Azure seems to be designed as if you're in a virtual machine or something. The horizontal scrolling is a mess.
All you have to do is go back to the outage/event timeline for GCP to know exactly how ill-informed of a decision that would be. (Not to mention all the other product offering differences)
I would be curious to hear what any current or former AWS employees think the internal consequences become for this.
From the outside, it feels like they are going to have to do something different to get some customer confidence back. Some sort of "mea culpa" with an explanation of what they are going to change.
Every place I'v worked has celebrated fixing things that have broken, and forces engineers to fight for prevention. It's not exactly celebrating things breaking, but it sure looks a lot like it.
That's only true if lessons are learned. If an outage happens and you do nothing to fix it (or worse, do something entirely too specific without fixing the real problem and thinking you've done it) the exact same problem will happen again and you'll get another opportunity to not learn the lesson.
Can't speak for other teams, but the dynamics in mine are pretty great. We don't overwork and leadership puts explicit emphasis on work/life quality.
From what I've heard in other teams (including some that own the big services you're familiar with), it can be a shitshow. Oncalls get paged 10*N times a week (typically 24/7 2-week rotation with two active oncalls, but depends), teams are leaking talent, desperately trying to hire good engineers while keeping the lights on.
Apologies are not actionable. Service providers promise some number of 9s percent uptime, not zero problems. It should be obvious that perfection is not achievable, and that at scale small problems can get magnified.
Customer confidence is only an issue if there are alternative providers that don’t have problems. That isn’t the case with cloud hosting.
I'm unclear on what you're responding to. I wasn't saying an apology was the right remedy, nor did I suggest they should have zero issues.
Customer confidence is an issue because they've had 4 notable issues in a very short timeframe.
Edit: I think you're reading something into my comments that's more than what I've said. I am curious what the AWS response to 4 major incidents in a month's time will be. That's it. At least in my circles, it is an issue for customers. I respect that it appears not to be in your circles.
Ahh. By "response", I did not mean the incident summary. I meant the overarching company response, if any. Like policies around change control, capacity additions, and so on.
The issues affected only a fraction of customers. Long threads on HN speculating about the causes and armchair quarterbacking have little to do with customer confidence.
All of the big cloud providers have problems, so unless there's a service with similar offerings and reach with demonstrably better reliability customer confidence is not going to matter -- people will use the least bad of the available choices.
I have multiple customers hosting on AWS, and a couple on GCP, and one on Azure. Azure by far has the most problems in my limited experience. I have servers on AWS (US East Ohio and US West Oregon) with 600+ days of uptime. None of my customers except the one on Azure have ever contacted me about an outage, in two or three years. At the same time I can read long threads like this on HN full of doom and gloom and predictions about the decline of AWS. Which customers are we talking about? I know I'm not the customer for AWS -- I don't pay for hosting.
Having moved more than a dozen companies from self-hosting and co-located hosting to AWS (or GCP, Azure if that's what they want) I can say that they are 100% happier. Relatively infrequent outages that AWS has the resources and incentive to jump all over and fix are preferable to trying to get me or some other pissed off engineer to drive in and try to figure things out.
If quality, reliability, and frequent outages drove customers away then Tesla would have withered up years ago. There are more factors in play here than what a small number of committed tech geeks (like me) think about how AWS could be doing things better.
The responses are basically corporate form letters: We're sorry, here's the (undecipherable) root cause, here's what we (wish/hope/pray for) will happen going forward. Those usually include some unverifiable numbers that downplay the severity, duration, and number of affected sites.
HN has lots of these apologies that seem insincere: Here's what happened, here's how we fixed it, here's what we will do to prevent this happening again. It's a PR exercise, not something that necessarily improves my confidence.
Any complex technology at AWS scale is going to have outages and mysterious problems and glitches, all the time. That's inherent to both technology and human organizations, and the HN crowd should understand that better than lay people. These threads mainly serve as launch points for endless armchair diagnostics and proposed solutions from people who have no skin in the game.
Complete speculation; but I wonder if how Amazon works their employee's so hard is the root cause of this. Layer on COVID as an additional stressor and give it a few years and this could be the result. Curious to hear from those at Amazon, what's it been like during COVID?
They're losing engineers left and right, especially in H2 2021. It's probably all related. Their recruiters' emails are increasingly desperate. If I was considering a job at Amazon now, I would be very interested in a good answer for how they plan to address their attrition problem (and paying new hires well isn't a viable answer alone).
All they need to do to hire me is lay off the draconian IP assignment clauses or give me an explicit exception for side projects. That's what I tell the desperate recruiters, and then they go away until the next one comes along.
There was an article on The Register[0] about something especially nutty Amazon Games Services required of employees, then later backed off on. (Cheers to them for caving to pressure.) Apart from that I've got no good references. I believe AWS typically put the "usual" overbearing clause in their contracts, where any IP you create before or during your employment becomes theirs. Sadly, so do many other employers. My previous employer did, and it was a sticking point for me and one reason for leaving.
They have been getting desperate, much more so than my quarterly Facebook/Google email. I've also had multiple Amazon recruiters reaching out, whereas I usually have a single Facebook/Google one. But apparently they aren't desperate enough to allow remote work after Covid, so I keep reminding them I'm not interested in non-remote work.
Can you share names? Asking for a friend ... Although if they are small companies, I would be wary of anecdata... The narrative sounds good but small companies are more likely than large companies to exhibit extreme behavior in any metric, and it's likely that there's some confirmation bias in reporting what works / doesn't work
that’s true, and unless you trust someone in the inside you can’t rely on a company reporting accurately about themselves… plus the tide at a small company can turn in an instant, just look at basecamp
If we're going into pure speculation mode I think we might also want to include that maybe their outages are staged and have factored in that the temporary bad press and paying out SLA credits is more profitable because it guides existing customers to start scaling to multiple AWS regions to offset these infrequent downtime events in 1 specific region at a time.
For example if you have hosting on US West but now to help reduce time down you duplicate your infrastructure in the EU region then I'm pretty sure the outcome here is if you were paying $6,000 a month before now you would be paying $12,000 a month (+/- any regional cost differences).
Realistically I don't believe this is the case but it's not impossible.
The rule of thumb of efficiency is that it can make things worse when it breaks. How that applies to organizations is if you have an incredible org with a higher plates to plate spinner ratio than ever, then if you have a catastrophe, the org may not have enough people to actually get the plates spinning again as fast as a less efficient, larger crew with more reserve.
AWS was always stressful. Great people for the most part. Culture had strong pros and strong cons. Same for treatment of employees; mixed bag.
During COVID everyone who didn’t need to be in the office worked from home. That was sometimes more stressful as you had lots more time in video meetings and no hallway interactions to make quick decisions on easy stuff.
For the problem reports today I am in wait and see mode. It’s not clear to me yet that AWS is the problem; could be a six-pack-and-a-backhoe kind of problem with a network provider.
Attrition has been super high during Covid and they raised the pay a lot for new offers to try and attract talent. (source: I got an offer for S3 mid-Nov. this is what the hiring manager told me).
…or there are Internet issues that are causing issues for people. Do we actually know AWS is having issues or people are speculating?
Not seeing any issues here, but am seeing people reporting broader internet issues at the moment. Post title seems a bit quick on the trigger to point blame.
Why isn't there any investigative journalism looking into wtf is happening at AWS? What the point of all these tech journalists writing all those puff pieces for access if they're not going to use it to pierce the NDA shield at times like this?
Amazon has been under the spotlight for labor practices, safety, wages. I’m sure the journalists can still order stuff from Amazon.
As a programmer I can probably guess how AWS experiences problems: programming and networking are hard problems, they get exponentially harder at scale, and there’s no known way to prevent every problem.
If AWS or some journalists gave us details we would just see 200 posts about how they could have done it better with Rust, nothing actually useful for anyone.
You can't figure these kind of outages out from the outside, and Amazon aren't going to explain it you except for their official explanations. Plus - 'access' means doing what the subject tells you, certainly not doing anything investigative.
This is getting to be a massive joke. Any reputation AWS has for being robust has been wiped out over the past month or so. I feel bad for the various engineers at AWS and other companies who are having to work on their day off.
I've always heard about technical debt at AWS, with thousand line functions everyone is terrified to touch, demotivated employees doing whatever it takes to close tickets without dealing with the root causes, and so on.
I presume that this rash of downtime is just the rickety structure inevitably creaking and breaking. It's probably too late now to fix things — dealing with legacy code requires patience, discipline, and understanding which the management of Amazon doesn't have. They'll just yell at people louder, and hold people "accountable" by punishing anyone where anything breaks.
I worked there and can vouch for what you said. AWS is built like a house of cards. They just painted it orange and white to make it look cohesive. What's funny though is that the CSS isn't even sourced from one place. They tell front-end engineers to use a dropper tool and to just copy and paste style from other places. DRY is not in Amazon's vocabulary.
Gotta love their leadership principles:
> Frugality (we'll give you a shitty laptop, shitty chair, and a shitty desk.)
> Be right a lot (are you too stupid to forecast the future?!)"
So this is the 4th time an uninteresting update like this has come to the front page... Why? What can be gathered from it? Get the AWS post-mortem on here, for that I'll be piqued. HN is not a status page. I don't instinctively check HN when my service is down because why would I?
As a counterpoint, HN is one of the first places I check if a big service seems to be down, because it usually shows up here before their status pages.
I have been wondering if Charlie Bell leaving AWS has anything to do with these recent outages. His ops meetings were fucking brutal. I’m curious how they are since he left.
A lot of senior AWS employees have left. What the root causes of these outages are is a well guarded secret. I work at Amazon and no one I know knows anything in detail about what went down and why
Anyone deploying to production Christmas eve, or even the week of Christmas, should have to get direct approval from the CTO of the company to do so. Just a terrible, terrible idea.
According to ThousandEyes, it looks like it was AS16509 that was affected. The outage lasted for 28 min; however, it didn't seem to affect many applications for too long.
I think it's because there was a backup server that kicked in for AS16509; however, it also went down but only for 8 min.
According to ThousandEyes, the bigger outage though is TATA Communications server AS6453. That was down or at least having issues of some sort for 1 hr 11 min.
I don't know about "4th time this month alone" which seems a bit editorialised, but our company is off AWS for now, and most likely for good.
We're not in the cloud or tech business per se, and as such our customers are not really understanding of technical issues which unfortunately means they are blaming us, and our own reputation is on the line because of AWS' shortcomings.
We did consider Google but for now OVH is the only major provider which is both reliable and secure (w.r.t. court-and-gag letters from government and intelligence agencies) as far as we're concerned. We still use Hetzner and Scaleway for some older stuff and also because of Scaleway's low prices, but it's likely we're moving everything to OVH in the future.
Wonder how much of this is caused by engineers rushing to fix log4j issues at AWS so they can enjoy the holidays? Their stack is heavily Java-based isnt it?
But you know it wouldn't if it had to rely on your home connection AND Amazon. It's like all those major website that rely on 10 domains to serve one webpage...
I feel obligated to point out that this is a very editorialized title, which is against site guidelines. Not... that I don't sympathize, just that it's perhaps a bit out of line even for providing context.