Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.

Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.

Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.



Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.


The reality is this is how you design highly available systems, and it is also imo one of the reasons microservices have gained so much popularity.

Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.

An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.


While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.


Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.


Oh, interesting. Thanks for pointing this out. I was having Chromecast trouble this morning and didn't even think to check if it was a widespread issue.


GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.


Facebook springs to mind as well.


Facebook quite often breaks their stuff and/or goes down, however, their outages usually last for just a few minutes.


They recently pushed out an iOS update for Messenger that crashed to springboard any time you tried to resume it from background. It took a couple of hours to get a new build up, plus however long for affected users to all install the new version.

I'd love to hear the story of how that made it through testing.


What does "crashed to springboard" mean?


Sorry, should have just said "home screen" for clarity, but SpringBoard is the iOS application that makes the home screen. It's akin to Finder.

A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.

Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.

https://www.theverge.com/2018/6/15/17468136/facebook-messeng...

My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.

Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.


springboard is essentially the Finder application on the iPhone - so crashed to springboard means crashed to home screen, basically.


Facebook breaks features very often. Sometimes things go missing and comes back a week later. Dropbox does this a lot too.


It's the same for all sites beyond a certain size. It's never fully up. It's very rarely fully down. It's gradually degraded in ways that you hopefully don't see, but sometimes do. Or maybe you don't see it, but others do. etc etc etc. Availability isn't boolean once you have users.


And it makes headlines when they are down even partially. Same with iCloud (although their track record isn’t the greatest)


And those SRE's?

They use IRC.


From the Google SRE book:

> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.

https://landing.google.com/sre/book/chapters/managing-incide...


How many SREs does Google have on said IRC system?

How many SREs are at Slack, working on keeping their systems up?

Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?

I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.


they use IRC and they have a previously-communicated contact plan with redundant contact methods for when IRC is unavailable.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: