It seems global to me. This is really strange compared to AWS. I don't remember ...

murat124 · on June 2, 2019

You obviously don't recall the early years of AWS. Half of internet would go down for hours.

djsumdog · on June 2, 2019

Back when S3 failures would take town Reddit, parts of Twitter .. Netflix survived because they had additional availability zones. I can remember the bigger names started moving more stuff to their own data centers.

AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.

angstrom · on June 3, 2019

Netflix actually added the additional AZs because of a prior outage that did take them down.

"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."

https://www.networkworld.com/article/3178076/why-netflix-did...

aaronblohowiak · on June 3, 2019

We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).

mkl · on June 3, 2019

That sounds fascinating! How often does your team have to leap into action?

aaronblohowiak · on June 3, 2019

We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.

justinator · on June 3, 2019

Do you do any chaos testing? Seems like it would slot right in, there.

Zobat · on June 3, 2019

I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey

a_t48 · on June 3, 2019

Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering

aaronblohowiak · on June 3, 2019

https://www.oreilly.com/library/view/chaos-engineering/97814... ;)

arainwater · on June 3, 2019

they have invented the term, so probably yes :)

azimuth11 · on June 3, 2019

I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.

fnord123 · on June 3, 2019

Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.

fnord123 · on June 3, 2019

Derp, Site Reliability Engineering.

https://landing.google.com/sre/books/

sulam · on June 2, 2019

I think you’re misremembering about Twitter, which still doesn’t use AWS except for data analytics and cold storage last I heard (2 months ago).

ceejayoz · on June 3, 2019

Avatars were hosted on S3 for a long time, IIRC.

StreamBright · on June 2, 2019

I am not sure if a single S3 outage pushed any big names into their own "datacenter". S3 has still the world record of reliability that you cannot challenge with your inhouse solutions. You can prove it otherwise. I would love to hear a solution that has the same durability, avabiality and scalability as S3.

For the downvoters, please just link here the proof if you disagree.

Here are the S3 numbers: https://aws.amazon.com/s3/sla/

snicker7 · on June 2, 2019

It's not so much AWS vs. in-house. But AWS (or GCP/DO/etc.) vs. multi/hybrid solutions. The latter of which would presumably have lower downtime.

didibus · on June 3, 2019

I don't see why multi/hybrid would have lower downtime. All cloud providers as far as I know, though I know mostly of AWS, already have their services in multiple data-centers and their endpoints in multiple regions. So if you make yourself use more then one of their AZs and Region, you would be just as multi as with your own data center.

zambal · on June 3, 2019

Using a single cloud provider with a multiple region setup won't protect you from some issues in their networking infrastructure, as the subject of this thread supposedly shows.

Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.

didibus · on June 4, 2019

Hum, I'm not an expert on Google cloud, but for AWS, regions are completely independent and run their own networking infrastructure. So if you really wanted to tolerate a region infrastructure failure, you could design your app to fail over to another region. There shouldn't be any single point of failure between the regions, at least as far as I know.

solidasparagus · on June 2, 2019

Why would you think that self-managed has lower downtime than AWS using multiple datacenters/regions?

KirinDave · on June 3, 2019

Actually, I imagine that if you could go multi-regional then your self-managed solution may be directly competitive in terms of uptime. The idea that in-house can't be multi-regional is a bit old fashioned in 2019.

StreamBright · on June 3, 2019

For several reasons, most notably: staff, build quality, standards, knowledge of building extremely reliable datacenters. Most of the people who are the most knowledgeable about datacenters also happen to be working for cloud vendors. On the top of that: software. Writing reliable software at scale is a challenge.

ummonk · on June 3, 2019

Multi/hybrid means you use both self managed and AWS datacenters.

dgoldstein0 · on June 3, 2019

Cannot challenge with your own inhouse solutions, you say?

Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...

but to be fair, storage is core to Dropbox's business... this is not true for most companies.

disclaimer: I work for Dropbox, though not on Magic Pocket.

qes · on June 3, 2019

> For the downvoters, please just link here the proof if you disagree.

> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

99.9%

https://azure.microsoft.com/en-au/support/legal/sla/storage/...

99.99%

ti_ranger · on June 4, 2019

>> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

> 99.9%

(single-region)

There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.

> https://azure.microsoft.com/en-au/support/legal/sla/storage/....

> 99.99%

99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"

Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").

StreamBright · on June 3, 2019

Azure is a cloud solution. The thread is about how a random datacenter with a random solution is better than S3.

ozymandias12 · on June 3, 2019

Wow, he’s comparing the storages SLA of the two biggest cloud services in the world. Pedantic behavior should hurt.

fusl · on June 2, 2019

> For the downvoters, please just link here the proof if you disagree.

https://wasabi.com/

snazz · on June 2, 2019

How can they possibly guarantee eleven nines? Considering I’ve never heard of this company and they offer such crazy-sounding improvements over the big three, it feels like there should be a catch.

ignoramous · on June 2, 2019

11 9s isn't uncommon. AWS S3 does 11 9s (upto 16 9s with cross region replication?) for data durability, too. AFAIK, AWS published papers about their use of formal methods to ascertain bugs from other parts of the system didn't creep in to affect durability/availability guarantees: https://blog.acolyer.org/2014/11/24/use-of-formal-methods-at...

This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...

nullwasamistake · on June 3, 2019

You have to be kidding me. 14 9's is already microseconds a year. Surely below anybody's error bar for whether a service is down or not.

16 9's and aws should easily last as long as the great pyramids without a second worth of outage.

What a joke

agwa · on June 3, 2019

The 16 9's are for durability, not availability. AWS is not saying S3 will never go down; they're saying it will rarely lose your data.

nullwasamistake · on June 3, 2019

This number is still total bullshit. They could lose a few kb and be above that for centuries

deanCommie · on June 3, 2019

It's not about losing a few kb here and there.

It's about losing entire data centers to massive natural disasters once in a century.

jefftk · on June 3, 2019

None of the big cloud providers have unrecoverably lost hosted data yet, despite sorting vast volumes, so this doesn't seem BS to me.

mentat · on June 3, 2019

AWS lost data in Australia a few years ago due to a power outage I believe.

anbop · on June 3, 2019

on EBS, not on S3. EBS has much lower durability guarantees

nullwasamistake · on June 3, 2019

Not losing any data yet doesn't give justification for such absurd numbers

joshuamorton · on June 3, 2019

Those numbers probably aren't as absurd as you think. 16 9s is, I think 10 bytes lost per exabyte-year of data storage.

There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.

joshuamorton · on June 3, 2019

This is for data loss. 11 9s is like 1 byte lost per terabyte-year or something, which isn't an unreasonable number.

StreamBright · on June 3, 2019

This is why I linked the SLA page which you obviously have not read. There are different numbers for durability and availability.

johnmaguire · on June 2, 2019

For data durability? I believe some AWS offerings also have an SLA of eleven 9's of data durability.

sascha_sl · on June 3, 2019

11 9s of durability, barely two 9s of availability

I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.

StreamBright · on June 3, 2019

I was asking numbers of reliability, durability and availability for a service like S3. What does wasabi has to do with that?

ljm · on June 2, 2019

Always in Virginia, because US-east has always been cheaper.

mooreds · on June 2, 2019

I know a consultant who calls that region us-tirefire-1.

autotune · on June 2, 2019

I and some previous coworkers call it the YOLO region.

thruhiker · on June 2, 2019

The only regions that are more expensive than us-east-1 in the States are GovCloud and us-west-1 (Bay Area). Both us-west-2 (Oregon) and us-east-2 (Ohio) are priced the same as us-east-1.

angstrom · on June 3, 2019

I would probably go with US-EAST-2 just because it's isolated from anything except perhaps a freak Tornado and better situated on the eastern US. Latency to/from there should be near optimal for most eastern US/Canada population.

thruhiker · on June 3, 2019

One caveat with us-east-2 is that it appears to get new features after us-east-1 and us-west-2. You can view the service support by region here: https://aws.amazon.com/about-aws/global-infrastructure/regio....

angstrom · on June 6, 2019

Fair point. It depends on what the project is.

Scoundreller · on June 2, 2019

And for those of us in GST/HST/VAT land, hosting in USA saves us some tax expenditures.

AnssiH · on June 2, 2019

How?

At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).

Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.

Scoundreller · on June 2, 2019

My experience is with Canadian HST.

Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.

I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.

Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).

AnssiH · on June 3, 2019

Makes sense.

I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.

paranoidrobot · on June 3, 2019

AWS is registered for Australian GST - they therefore charge GST on all(ish) services[0].

IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.

For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit[1].

[0] https://aws.amazon.com/tax-help/australia/

[1] https://www.ato.gov.au/Business/GST/Claiming-GST-credits/

BillinghamJ · on June 2, 2019

You know, normally you still have to pay that tax - just through a reverse charge process

Scoundreller · on June 2, 2019

Not the case in Canada if you’re not an HST registrant (non-business or a small enough business where you’re exempt).

Even if you did have to self-assess, better to pay later than right away.

threeseed · on June 2, 2019

Mostly because those sites were never architected to work across multiple availability zones.

rincebrain · on June 2, 2019

Years ago, when I was playing with AWS in a course on building cloud-hosted services, it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage, so while technically all our VMs hosted out of other AZs were extant, all our attempts to manage our VMs via the web UI or API were timing or erroring out.

I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.

(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)

jonhohle · on June 2, 2019

> it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage

Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.

And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.

molesy · on June 3, 2019

Quoting https://docs.aws.amazon.com/general/latest/gr/rande.html

"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."

There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.

scarface74 · on June 3, 2019

The original poster said that none of AWS services are in a single AZ, the quote you referenced says that IAMs do not support regions.

Your quote cd mean two things.

- that IAM services are hosted in one region (not one AZ)

And/Or

- that IAM is for the entire account not per region like other services (which is true)

chucky_z · on June 3, 2019

Just this year an issue in us-east-1 caused the console to fail pretty globally.

rincebrain · on June 3, 2019

Quite possibly, it has been a number of years at this point, and I didn't dig out the conversations about it for primary sourcing.

boulos · on June 2, 2019

Where are you based? If you’re in the US (or route through the US) and trying to reach our APIs (like storage.googleapis.com), you’ll be having a hard time. Perhaps even if the service you’re trying to reach is say a VM in Mumbai.

digaozao · on June 2, 2019

I am in Brazil, with servers in southamerica. Right now it seems back to normal.

nodesocket · on June 2, 2019

I have an instance in us-west-1 (Oregon) which is up, but an instance in us-west-2 (Los Angeles) which is down. Not sure if that means Oregon is unaffected though.

not_kurt_godel · on June 2, 2019

us-west-1 is Northern California (Bay area). us-west-2 is Oregon (Boardman).

nodesocket · on June 2, 2019

Incorrect. GCE us-west1 is the Dalles, Oregon and us-west2 is Los Angeles.

not_kurt_godel · on June 3, 2019

What I said is correct for AWS. In retrospect I guess the context was a bit ambiguous.

(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)

vast · on June 2, 2019

EUW doesn't seem to be affected.

EugeneOZ · on June 2, 2019

My instance in Belgium works fine