In light of how Slack and other companies haven't been able to get a decent leve...

londons_explore · on June 27, 2018

Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.

roganartu · on June 27, 2018

The reality is this is how you design highly available systems, and it is also imo one of the reasons microservices have gained so much popularity.

Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.

An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.

anderiv · on June 27, 2018

While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.

werid · on June 27, 2018

Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.

phalangion · on June 27, 2018

Oh, interesting. Thanks for pointing this out. I was having Chromecast trouble this morning and didn't even think to check if it was a widespread issue.

adreamingsoul · on June 27, 2018

GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.

SmellyGeekBoy · on June 27, 2018

Facebook springs to mind as well.

seba_dos1 · on June 27, 2018

Facebook quite often breaks their stuff and/or goes down, however, their outages usually last for just a few minutes.

wlesieutre · on June 27, 2018

They recently pushed out an iOS update for Messenger that crashed to springboard any time you tried to resume it from background. It took a couple of hours to get a new build up, plus however long for affected users to all install the new version.

I'd love to hear the story of how that made it through testing.

wuliwong · on June 27, 2018

What does "crashed to springboard" mean?

wlesieutre · on June 27, 2018

Sorry, should have just said "home screen" for clarity, but SpringBoard is the iOS application that makes the home screen. It's akin to Finder.

A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.

Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.

https://www.theverge.com/2018/6/15/17468136/facebook-messeng...

My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.

Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.

eric_h · on June 27, 2018

springboard is essentially the Finder application on the iPhone - so crashed to springboard means crashed to home screen, basically.

forgot-my-pw · on June 27, 2018

Facebook breaks features very often. Sometimes things go missing and comes back a week later. Dropbox does this a lot too.

lclarkmichalek · on June 27, 2018

It's the same for all sites beyond a certain size. It's never fully up. It's very rarely fully down. It's gradually degraded in ways that you hopefully don't see, but sometimes do. Or maybe you don't see it, but others do. etc etc etc. Availability isn't boolean once you have users.

dawnerd · on June 27, 2018

And it makes headlines when they are down even partially. Same with iCloud (although their track record isn’t the greatest)

codemac · on June 27, 2018

And those SRE's?

They use IRC.

JeremyBanks · on June 27, 2018

From the Google SRE book:

> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.

https://landing.google.com/sre/book/chapters/managing-incide...

fragmede · on June 28, 2018

How many SREs does Google have on said IRC system?

How many SREs are at Slack, working on keeping their systems up?

Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?

I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.

philsnow · on June 27, 2018

they use IRC and they have a previously-communicated contact plan with redundant contact methods for when IRC is unavailable.