Slack outage: Connectivity issues affecting all workspaces

anonu · on June 27, 2018

Previous outages:

https://news.ycombinator.com/item?id=16108912 - 5 months ago (longer discussion)

https://news.ycombinator.com/item?id=15597461 - 7 months ago

https://news.ycombinator.com/item?id=15597431 - 8 months ago

https://news.ycombinator.com/item?id=13811815 - 1 year ago

https://news.ycombinator.com/item?id=10616743 - 3 years ago

nojvek · on June 27, 2018

In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.

Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.

Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.

londons_explore · on June 27, 2018

Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.

roganartu · on June 27, 2018

The reality is this is how you design highly available systems, and it is also imo one of the reasons microservices have gained so much popularity.

Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.

An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.

anderiv · on June 27, 2018

While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.

werid · on June 27, 2018

Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.

phalangion · on June 27, 2018

Oh, interesting. Thanks for pointing this out. I was having Chromecast trouble this morning and didn't even think to check if it was a widespread issue.

adreamingsoul · on June 27, 2018

GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.

SmellyGeekBoy · on June 27, 2018

Facebook springs to mind as well.

seba_dos1 · on June 27, 2018

Facebook quite often breaks their stuff and/or goes down, however, their outages usually last for just a few minutes.

wlesieutre · on June 27, 2018

They recently pushed out an iOS update for Messenger that crashed to springboard any time you tried to resume it from background. It took a couple of hours to get a new build up, plus however long for affected users to all install the new version.

I'd love to hear the story of how that made it through testing.

wuliwong · on June 27, 2018

What does "crashed to springboard" mean?

wlesieutre · on June 27, 2018

Sorry, should have just said "home screen" for clarity, but SpringBoard is the iOS application that makes the home screen. It's akin to Finder.

A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.

Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.

https://www.theverge.com/2018/6/15/17468136/facebook-messeng...

My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.

Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.

eric_h · on June 27, 2018

springboard is essentially the Finder application on the iPhone - so crashed to springboard means crashed to home screen, basically.

forgot-my-pw · on June 27, 2018

Facebook breaks features very often. Sometimes things go missing and comes back a week later. Dropbox does this a lot too.

lclarkmichalek · on June 27, 2018

It's the same for all sites beyond a certain size. It's never fully up. It's very rarely fully down. It's gradually degraded in ways that you hopefully don't see, but sometimes do. Or maybe you don't see it, but others do. etc etc etc. Availability isn't boolean once you have users.

dawnerd · on June 27, 2018

And it makes headlines when they are down even partially. Same with iCloud (although their track record isn’t the greatest)

codemac · on June 27, 2018

And those SRE's?

They use IRC.

JeremyBanks · on June 27, 2018

From the Google SRE book:

> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.

https://landing.google.com/sre/book/chapters/managing-incide...

fragmede · on June 28, 2018

How many SREs does Google have on said IRC system?

How many SREs are at Slack, working on keeping their systems up?

Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?

I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.

philsnow · on June 27, 2018

they use IRC and they have a previously-communicated contact plan with redundant contact methods for when IRC is unavailable.

rileyt · on June 27, 2018

There is also this https://status.slack.com/calendar, but they seem to grossly under report the actual downtime...

[edit] note that including this outage, they are reporting to have missed their monthly uptime guarantee 3 months in a row.

dexterdog · on June 28, 2018

Yeah, stripe does the same thing with their status page. I get alerts that they have an outage at least once a week and more often than not it never shows up as anything in their history. Honestly this is my only significant beef with the service and I've been using it for years now with multiple integrations.

sarreph · on June 27, 2018

You know how much of the community uses one messaging system when 15 minutes after it going down, it has over 40 points on the front page!

This says a lot about how it's a single point of failure in modern company comms.

It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...

ITT: Chat about decentralisation that will ultimately lead to no action.*

*Because we've had this discussion so many times before...

FooBarWidget · on June 27, 2018

Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.

philwelch · on June 27, 2018

> In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that.

Although with outages like these, I doubt it!

vasilipupkin · on June 27, 2018

if the software is architected this poorly so that it can literally go down simultaneously for all clients, then why would I trust that it's secure?

drb91 · on June 27, 2018

> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.

Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.

ljm · on June 27, 2018

In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.

- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.

- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.

- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.

This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.

nettdata · on June 27, 2018

If your team can't check on that stuff manually for a few hours while Slack is down, then I think you may have bigger problems.

If anyone on my team came to me and cited Slack being down as a reason for their inability to do their job, then they wouldn't be on my team.

Is it less than ideal? Yes. Is it a little bit less efficient to pull info instead of having it pushed to you? Yes.

Is the sky falling? No.

mikec3010 · on June 27, 2018

I think it's inexcusable for a chat program to go down in 2018.

* your hdd failed? Use a raid

* your power went out? Use a UPS

* your DNS went down? Use a fallback (slack2)

* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over

See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".

And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.

jaredhansen · on June 27, 2018

>slack is the most trivial software you can think of

This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.

Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:

1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!

or

2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.

Which of these do you think is more probable?

mikec3010 · on June 27, 2018

3) slack will get their "chat as a service" monthly fee whether the service actually works or not, so why commit to higher levels of service? We can get our users acclimated to outages and then sell them "slack Premium, for Serious Business", charge an even higher fee, and get stupendously rich all over again. This is the "growth" that investors demand, no?

subway · on June 27, 2018

The dark truth is I suspect we're moving in the opposite direction. Abstraction layers designed with that "chill, it's just a %s app" mindset are making their way into safety critical applications.

Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.

dota_fanatic · on June 27, 2018

Slack is text, channels, images, video, sound, search, audio calls, video calls, screen share (and interface share), bots, myriad integrations, and more. Calling it just "send text from one computer to another" is wrong.

apitman · on June 27, 2018

I think maybe their point is that even if other pieces break, why shouldn't it be possible for the text communication to keep working?

dustinmoorenet · on June 27, 2018

If trying to provide all the other things besides text causes the system to be unstable, then maybe those things shouldn't have been added. We need text. We just want the other things.

glintik · on June 27, 2018

Let me add more reasons: 1) Software human mistake, when some software error/exception throws much larger issues, that require manual restore with service downtime.

2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.

3) Bad system design, full of "one point of failure".

mikec3010 · on June 27, 2018

> ) Geodistributed datacenters is VERY expensive thing, so not implemented fully

You buy servers on aws-us-west and aws-us-east, and sync them . How is that very expensive?

EpicEng · on June 27, 2018

I imagine you've never actually had to solve any of these hard problems, which is why you think it's so easy to do.

mmt · on June 27, 2018

That's bordering on (if not crossing into) ad-hominem.

There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.

They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.

glintik · on June 27, 2018

You propose just to buy servers in 2 locations to keep Slack services up? Doesn't work, when you need to store gigabytes daily and have dozen thousand reqs/sec synchronized.

Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.

always_good · on June 27, 2018

[flagged]

mikec3010 · on June 27, 2018

I know. there's totally not a command called rsync. And "replication" is just a word you hear on star trek along with teleportation.

mmt · on June 27, 2018

Although I agree with your premise, I think the delivery takes away from your point a bit.

Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement [1].

Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.

So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.

[1] Although it had other criticisms, such as monetization, which are, naturally, ignored.

glintik · on June 27, 2018

Oh, yes, someone in Slack clicked the button “Pause” and we all are waiting, when Slack’s hero will click “Resume” :)

rbranson · on June 27, 2018

They very likely have all of these protections in place, and more. Large-scale outages of mature systems are almost always a cascade of small human errors that, each on their own, would have caused negligible damage. It's only when they happen to align with each other that a large disaster is realized.

djsumdog · on June 27, 2018

I worked at an open source company where they hosted their own IRC server. There are OSS alternatives to Slack and I wonder if that company has tried to adopt any of them.

This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).

If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).

cwyers · on June 27, 2018

That's uptime-as-anecdote. Yes, you can throw your entire IT department at your outage instead of waiting on the vendor to fix it. How many of us work somewhere where the entire IT team is as large as the team that works on Slack's uptime?

detaro · on June 27, 2018

How many self-hosted setups need the complexity and matching team size of a centralized service serving millions of users?

forgot-my-pw · on June 27, 2018

I remember the netsplits of IRC days...

RandallBrown · on June 27, 2018

Let's say the self hosted chat app does go down. Now someone has to fix it. Someone who probably has something better to do. In a cloud hosted solution, the person in charge of fixing your computer doesn't work for you.

My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.

Bartweiss · on June 27, 2018

I'm not sure about production dependent, but I'd love to see how many other companies have longer/worse outages thanks to this. There are definitely a lot of people counting on Slack as a sole channel to push low-level error notifications, and I doubt most of them have an easy fallback option.

brootstrap · on June 27, 2018

reading all this thread made me realize at my company (~50 people) we have a couple slack-bots that control a number of things, deploys being one of them. shrug

cremp · on June 27, 2018

It's not so much decentralization as chaos engineering.

Building the program with withstand failure after failure, of things in and out, of your control. Seems like Slack needs some chaos engineers...

ythn · on June 27, 2018

In my company we use Cisco Jabber for official comms but Slack unofficially. So when Slack goes down, we fall back on Jabber.

thomastjeffery · on June 27, 2018

My company has a customer support system that relies on Slack chatops.

It's an interesting morning, to say the least.

dijit · on June 27, 2018

To me it raises a concern, chatops and slack integrations are /very/ common, it's a form of vendor lock-in on their side and it makes absolute sense.

However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.

yani · on June 27, 2018

What channel of communication did you pick talk to your teammates on Slack? I've received messages by Facebook Messenger, Line, and the rusty email :)

danmg · on June 27, 2018

Here's a chat decentralisation platform: https://www.ratbox.org/ .

patrickg_zill · on June 27, 2018

I have used Mattermost and been pleased with it. It is an open-source Slack clone you can run on a low-end VM or your own hardware.

ivm · on June 27, 2018

Luckily most of my active communities are on Discord nowadays. It works much faster and even has a dark theme by default.

rkeene2 · on June 27, 2018

But it's not any less centralized, which I think was the complaint, not that it was popular even though this was mentioned.

sarreph · on June 27, 2018

> which I think was the complaint

Spot on, not that I know anything about Discord's architecture...

dictum · on June 27, 2018

I say this as someone who almost always prefers the dark theme wherever it is available: I wonder how much this desire for dark interfaces comes from almost every app interface having bright colors on white.

Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.

joobus · on June 27, 2018

In civil engineering circles, it's known that a room which is too bright will cause eye strain and fatigue. There is an optimal level of light for the eyes to be most effective. But the computer makers and UI designers don't take this into account. Dark themes transmit less light to the eyes, causing less fatigue over time.

hokus · on June 27, 2018

The trick is this" if you look into a bright light you cant see the rest of the room anymore. Its a kind of forced feeding. Not that the designers of our world are guilty of some sinister CONSPIRACY. They simply see awful-white as the only choice based on tests or they are just imitating what they know.

coryfklein · on June 27, 2018

I just installed Dark Mode for Firefox [1], it makes all websites have a dark theme. My eyes are already thanking me.

[1] https://addons.mozilla.org/en-US/firefox/addon/dark-mode-web...

palencharizard · on June 27, 2018

> even has a dark theme by default.

I love how having a dark theme is second only to "it works" in terms of how we pick services these days XD

asdojasdosadsa · on June 27, 2018

It's lovely how in Slack.app if you want dark theme, you have to modify the internet javascript files...

Kiro · on June 27, 2018

Sure, it's worrying but worth it for me personally. I might go to jail due to this (seriously) but at least people won't die. For me that's the threshold.

MockObject · on June 27, 2018

You can't leave us hanging like that. How could a Slack failure possibly send you to jail?

dannyrosen · on June 27, 2018

Hope Slack considers doing a post-mortem similar to Gitlab[1]. Sharing what they learned and giving customers context is appreciated.

[1]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

trcollinson · on June 27, 2018

Yes, that way we can beat them up for years to come based on whatever mistake they made. It would be even better if they told us which employee made the mistake so we can incessantly mock that employee openly and publicly every time Slack is ever mentioned on HN. When GitHub was purchased by Microsoft, Gitlab came up quite a bit and we got to rehash that whole database outage over again many times over those few days. It was sad.

If it were my company, I would say a little as humanly possible.

coryfklein · on June 27, 2018

It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.

derefr · on June 27, 2018

It’s not about assigning blame for the company writing the post-mortem. But it’s definitely about assigning blame for most people reading the post-mortem. Very few people read post-mortems for the sake of learning how to be better at release engineering and ops.

thomastjeffery · on June 27, 2018

If I pay for your service, and you are transparent about mistakes and flaws, I will be more forgiving about mistakes and flaws in the future, and appreciate the work you do to fix them.

If I pay for your service, and the only communication is, "We know there is a problem, and we'll let you know when it's fixed", I may assume you are not equipped to thoroughly explain the problem, and therefore not well equipped to solve it.

The blame is already assigned. The users already know there is a problem. A post-mortem likely has a positive effect for the readers attitude toward the handling of the issue.

derefr · on June 27, 2018

It’s more the people who don’t pay for the service, but might, that are quickest to see post-mortems in a negative light. The only reason they have for reading them is looking for justifications for culling the product/service from the list of contenders for when they ever have to evaluate solutions in that category.

In other words: post-mortems are good PR, but incredibly bad advertising.

dexterdog · on June 28, 2018

And a world-wide outage followed by "we fixed it and trust us it won't happen again" is going to filter any service off of my list more so than "we had a single point of failure running in our CTO's basement and his cleaning lady pulled the plug. Trust us it won't happen again."

trcollinson · on June 27, 2018

I entirely understand what you are saying, believe me I do. But that is not the way some communities take it. We still see messages like "You could move to Gitlab but... you know they dropped their production database a couple of years back? Use them at your own risk!"

We learned a lot from the Gitlab outage. It was a simple mistake and not one they will have again, yet people still beat them up for it. I'm not sure the value is there for the company to be super open about their outages and issues.

calcifer · on June 27, 2018

On the contrary, I would trust them quite a bit less, not more, if they had an hours long outage without any explanation.

repolfx · on June 27, 2018

Perhaps - but would you even remember it, without the juicy details of what happened? I probably would forget if some service had a few hours downtime a year or two ago, if I didn't know any details to make it stand out from other outages.

brandon272 · on June 27, 2018

Wouldn't they have gotten beaten up over the outage even more had they not offered an explanation?

In my experience, customers are often seeking an explanation/post-mortem because their customers are seeking an explanation. If an upstream service goes down for an extended period of time and all you can do is go back to your customers and say, "Your system was down because our provider's system went down for 4 hours. But they won't tell us why.", that not go over well.

danShumway · on June 27, 2018

Gitlab's response to the the database mistake was a large contributing factor in my decision to move all of my repositories onto their service.

Anecdotal, sure, but people like me exist. I don't know if we're in the majority. You'd have to measure somehow and do a cost-benefit analysis I guess.

isostatic · on June 27, 2018

I hope you don't work in aviation with that attitude!

trcollinson · on June 27, 2018

As usual people are taking a comment and twisting it any old way they'd like. Which is fine, that's why we have these communications. To start off, no I am not in aviation. I have run quite a few companies and development departments.

I am not suggesting Slack or anyone else should not communicate at all when they have an outage. A public postmortem, which many people are asking for, is one method. Is it the most effective method? I doubt it. Many people are suggesting that as paying customers they would like to know what happened. Does a public postmortem tell the paying customer what happened in an effective way? Maybe, but maybe not.

When I am running a company I care very much what my paying customers think and are feeling about my service. I will communicate issues directly to them. Do I need to explain to the rest of the world in some great technical detail what happened during an incident? Absolutely not. Do I need to have the first post in Google about my company be an outage postmortem? Of course not. I need my PAYING customers to be pleased with the service I offer and to understand how I will mitigate the damage I have done to them. To me, that's a basic principle of business. I don't have to explain to everyone. I owe everything to my paying customers. Gitlab did a postmortem almost immediately after a major outage and some people tried to slaughter them with the information they shared. It was sad and unfortunate. Their openness was met with some horrible results from the community.

Also, I use Slack. My company uses it for everything including ChatOps for my production environment deployment. We have a hundred of so active users. The outage this morning harmed us. But you know what? I don't pay for Slack. I owe a lot to Slack but they don't owe me anything. I can't blame them for my problems this morning. They are a free service to me. I appreciate that their absolutely free service servers my company so well almost all of the time.

isostatic · on June 27, 2018

My company does pay for slack, pays a lot, and I expect an RFO

trcollinson · on June 28, 2018

Excellent! If you somehow read my entire message and got out of it that Slack shouldn’t give you detail about the outage this morning, then I somehow did not portray how important it is to emplain issues and resolutions to paying customers. I hope you get a full break down and understand exactly how they will keep you from having this sort of outage again. If they don’t, then it becomes a value issue to decide whether you should move to another system.

My point is only that it does not have to be a large public explanation. You, or the decision maker at your company, who pays a substantial sum of money to slack for their service, should have an explanation until you are satisfied.

bgribble · on June 27, 2018

Maybe unrelated, but my AWS-hosted websockets-using app had an outage starting at the same time. Also a third-party API provider we use for handling inbound phone calls. So this smells like a wider outage than just Slack.

isostatic · on June 27, 2018

When I was in Moscow a few weeks back, Slack wouldn't work. Exact same behaviour - it loaded up the gui, loaded up previous conversations, but then wouldn't work past there.

Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.

switch007 · on June 27, 2018

That's interesting. More speculation: they haven't given any detail in 2 hours, perhaps if it's an upstream/3rd-party problem, they haven't been given any info.

bgribble · on June 27, 2018

I know it's not exactly scientific, but the front page of https://downdetector.com shows a number of services that have problem spikes starting anywhere from 3am US/Eastern to 9am US/Eastern and continuing through now (11:24 US/Eastern): Google Home, Fortnite, Exede, Level 3, New York Times, AWS. Maybe totally unrelated to each other, who knows.

dawnerd · on June 27, 2018

That certainly does look suspicious there - especially level3.

kohanz · on June 27, 2018

I'm wondering the same thing. I chose this morning to soft-launch my side-project/startup and sent out the sign-up link to my e-mail list. Of course, it's AWS Cognito-based, was working yesterday, but failed for the new users. Great timing! Phone support said they are looking into some outages (even though the status page is all green).

Maxion · on June 27, 2018

Telegram was down too just half an hour before slack. Dunno if they run on aws?

dijit · on June 27, 2018

They do, and GCP.

I recall AWS/GCP public IPs getting banned in Russia when they were trying to block telegram.

r1ch · on June 27, 2018

Maybe I'm reading too much into it, but "We've received word that all workspaces are having troubles connecting to Slack." makes it sound like their internal monitoring didn't catch whatever is causing this. I was personally experiencing issues for about 20-30 minutes before the status update was posted.

parliament32 · on June 27, 2018

Pretty much every time there's a slack outage it takes them a solid 20 minutes to update their status page. Several times I've emailed them 10 minutes into an outage (following "nobody at the office can reach slack, but their status page says smooth sailing, we should do more diagnostics in case it's office internet or something..."), then gotten a response 10 minutes later to the tune of "we're aware, we just updated our status page, go look at that". I think they consider updating their status page a PR problem, so they avoid if if the issue can be fixed in under X minutes.

Which also makes their uptime totals completely bogus.

raphaelj · on June 27, 2018

We started having issues connecting to Slack 4 hours before they reported a status.

joobus · on June 27, 2018

Their subsequent update makes it sound like they still don't have a clue.

> Our team is still looking into the cause of the connectivity issues, and we'll continue to update you on our progress.

bmcniel · on June 27, 2018

I think that is just the tongue in cheek language they like to use.

Dirlewanger · on June 27, 2018

Yeah, it's really funny and ironic when you cost customers money!

castis · on June 27, 2018

It's interesting to me that the update messages are posted every 30 minutes from 1st notification until resolution. Judging by this and every other outage I assume this is automatic, and probably implemented to appease the people who are probably frustrated by the outage.

https://status.slack.com/2018-06/142edcb9e52c7663

switch007 · on June 27, 2018

We have a similar policy at $WORK (but manual). In our experience customers go mental if you say absolutely nothing.

parliament32 · on June 27, 2018

There is also zero information in those statuses, which kinda defeats the purpose. Might as well just have the status landing page with no details.

patrickxb · on June 27, 2018

Good catch...definitely updating every 30 minutes exactly.

tapoxi · on June 27, 2018

It's times like this I wish there was a solid decentralized standard to pick from, but there's no clear choice between XMPP and Matrix.

kirbypineapple · on June 27, 2018

We use Slack for everyday company wide communication/ announcements and Riot for encrypted secure communications (you can host Riot yourself): https://about.riot.im/

sanderjd · on June 27, 2018

It's not about the protocols, it's about having a client with a user experience that is acceptable to an entire company rather than just a team of engineers. Which decentralized protocol has such a client? (Speaking as someone who got burned trying to advocate for IRC at a company that eventually and inevitably switched to Slack.)

ummonk · on June 27, 2018

Curious why you got burned with IRC client UX given the multitude of clients available for it.

sanderjd · on June 27, 2018

The multitude of clients is one of the problems! How do you find them? Which one do you use? What features matter? Nobody knows! They just want a product with chat rooms and don't understand why it seems so hard to do seemingly simple stuff like create an account or search for that link that someone posted a month ago.

swozey · on June 27, 2018

Technical people who haven't used IRC can barely figure out IRC their first time using it. Trying to sell IRC to a company would be hilarious. Bob in Accounting getting on IRC and feeling comfortable with it's UX?

danmg · on June 27, 2018

Then you've never seen the 'hilariously' bad UX they already put up with, with things like Quickbooks.

mIRC is pretty straight forward compared to that.

price · on June 27, 2018

A hypothesis I like is that when it's an application you use to communicate with other people, people are a lot less tolerant of a confusing UX.

The reason is that when you sit there clicking through a bunch of menus to find something in QuickBooks (or a typical atrocious enterprise app), nobody sees you; and if you screw something up there, you spend some more time fixing it and nobody sees the screwup. Frustrating maybe, as you waste time, but almost everyone has some frustrating wastes of time at work.

If you're on IRC and people are talking at you and you sit there fumbling to figure out how to respond, it's like you're in a conversation and tongue-tied and everyone's looking at you. And if you screw something up, like send a message to the wrong channel... now you've done it in front of all your coworkers, in real time. Humans hate looking stupid in front of the group.

And if you screw something up on IRC in front of your coworkers, and you're someone with even a little anxiety about not being tech-savvy... that's going to flare right up.

Also, because now you're embarrassed, you're going to want something to blame. So you blame the tool.

extra88 · on June 27, 2018

Yes. Also, QuickBooks is accounting, which is supposed to be hard while "chatting" with people is supposed to be easy.

QuickBooks doesn't have to suffer in comparison to better UX performing similar tasks in people's personal lives while IRC can be compared (unfavorably) to texting apps, Facebook Messenger, Twitter, AIM once upon a time, etc.

danmg · on June 28, 2018

mIRC offers a fairly good UX compared to all those.

If you're setting it up in a corporate environment, just change the ini files so it autoconnects to your server. It'll pop up a list of channels they can join. The server can SAJoin them to particular channels on connection too. The UI is very clean and lightweight: a channel scrolls messages and they appear, there's an input bar at the bottom, and there's a list of users on the side. It's written in MFC and Win32 APIs, so it's blazingly fast compared to most applications, and you can find a version that will run on every computer made in the past 25 years.

The united states military used mIRC extensively for battle field coordination. I think it's up to the task of handling bob from accounting.

extra88 · on June 28, 2018

An image search for mIRC shows that it is ugly as shit. It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation. Stored communication is mainly a server-side problem but I don't know if mIRC has an interface to show DMs you missed while offline or to indicate which part of a channel's conversation happened since you last looked.

Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?

The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.

danmg · on June 29, 2018

> An image search for mIRC shows that it is ugly as shit.

ok

> It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation.

No.

Each channel and private message get their own MDI window you are free to minimize, maximize or layout however you want.

Notifications are turned on by default, but they can be disabled. You'll get a tray notification if mIRC is minimized, and inside the title bar of the window will flash. Notifications happen when your nick is mentioned.

There's a horizontal line that goes across the dialog window that indicates the location of the conversation the last time it was focused.

>Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?

Other clients work on other platforms. mIRC is just what I brought up since it's desktop windows client and that the most common case for an office environment.

> The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.

It's a simple, light-weight way for people to send short text messages in near real time with tens of thousands of people. I think that's good enough, and it works at a scale that far surpasses the SaaS chat options.

tapoxi · on June 27, 2018

There's plenty of decent XMPP clients, like Spark (https://igniterealtime.org/projects/spark/) but they'd take an IT team to configure.

Matrix has Riot (https://riot.im/app) but personally I find it incredibly confusing.

sanderjd · on June 27, 2018

I'll have to take a look at Spark.

I don't think it's a problem if something needs to be initially deployed and configured by an IT department (or otherwise tech savvy individual or group), as long as its onboarding and primary usage flows are straightforward. An arbitrary non-tech-savvy but internet-familiar employee needs to be able to create an account, browse and join rooms, and search through history without any hand-holding. Slack and its direct competitors pass this test. IRC doesn't. Does Spark?

tapoxi · on June 28, 2018

It's certainly the closest of all XMPP clients I've used, since it has a very friendly interface. Their related Openfire XMPP server is also targeted at internal deployments and is very easy to configure with a web UI.

mynewtb · on June 27, 2018

Zulip is amazing, is a self hosted system without federation is an option for you.

tomstockmail · on June 27, 2018

Love the workflow with Zulip, but I hope they work out a way to join in with either Matrix or IRC3+ federation.

dielan · on June 27, 2018

Matrix is my preference for sure. It's fresh and exciting. While XMPP is harder to talk people into trying

cbm-vic-20 · on June 27, 2018

Their site (https://matrix.org) reeks of hype-oriented engineering. From the most cursory overview of their home page, their decentralization looks a lot like IRC peering.

dijit · on June 27, 2018

The federation elements actually work pretty well, more similar to XMPP than IRC.

edhelas · on June 27, 2018

If you are interested we are building a communication platform for communities fully based on XMPP https://movim.eu/ :) It can easily be deployed on a Web server.

rkeene2 · on June 27, 2018

There's always IRC ;-)

samuell · on June 27, 2018

IRC is actually viable, with https://riot.im for offline logging and mobile access.

yepthatsreality · on June 27, 2018

Create a new one![0]

[0] https://xkcd.com/927/

nixgeek · on June 27, 2018

Despite having a vote increment velocity far exceeding other items, a publish time of only 25 minutes ago, and more points, this item just dropped from #5 to #7 on the front page.

How’s that work exactly?

Edit: It’s now droppped to #14 even with comment count also rapidly increasing.

_wmd · on June 27, 2018

300 comments in one hour will definitely kill it. HN penalizes controversy, which it uses comment count as a proxy for. It works well most of the time

maxerickson · on June 27, 2018

Comment count is a factor.

edit: it's a negative factor...

nixgeek · on June 27, 2018

Thanks for the clarity, this makes more sense now!

calcifer · on June 27, 2018

Quoting myself from 8 months ago [1]:

> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.

> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.

[1] https://news.ycombinator.com/item?id=15576036

rattray · on June 27, 2018

Hmm, it's now off the front page entirely, which seems strange. I don't see much incendiary commenting or similar...

@dang, care to comment?

EamonnMR · on June 27, 2018

This might be the longest downtime I've ever seen for Slack.

sdf43543t345 · on June 27, 2018

IRC had uptime in the scale of decades. Why are our 2018 solutions so fragile?

efficax · on June 27, 2018

Eh, IRC networks split and individual servers went down all the time. But yes, there rarely was a complete EFNet outage even if sometimes there were 2 versions of the same channel going at once.

That being said although I like some slacks fancy features I do wish a distributed alternative could catch on.

dswalter · on June 27, 2018

Native emoji support, aesthetically pleasing front-ends, and clear product direction are some of the main positives I see, even if the combination of php on the backend and electron on the frontend aren't the most sophisticated technical components in history.

I prefer decentralized and open things, but a cohesive vision can sometimes provide a better user experience across a more restricted set of functionality than an army of hackers, each solving their own problems.

mynewtb · on June 27, 2018

Offline messaging, mobile clients, push notifications, history, search, rich text formatting, message editing, file transfer, etc etc etc

rkuykendall-com · on June 27, 2018

..., inline images, display names, deleting messages, editing messages, reactions, avatars, multi-user private messages, etc etc etc

hs86 · on June 27, 2018

Have you tried IRCCloud? Their web based front-end is as nice as Slack's but it still works with decentralized IRC servers. They also manage the client's state (unread messages) better than regular IRC bouncers.

detaro · on June 27, 2018

Emoji seem to work just fine on IRC nowadays, what do you mean by "native" support? The shortcodes? The fact that there's official clients you can entirely rely on supporting it?

rkeene2 · on June 27, 2018

Native emoji support, pretty front-ends, and clear product direction are possibilities on-top of IRC (or XMPP) since their absence isn't a core part of IRC (or XMPP) -- it's just not a good way to make a profit it if you don't lock down the network and act as the gatekeeper of the interface. Slack's API is fairly open though and it's not a huge hurdle to interact with it. I built an IRC<->Slack gateway that bridges the differences fairly well ( https://slack.tcl-lang.org/ , you know, if Slack were working).

vertex-four · on June 27, 2018

IRC netsplits quite a bit, breaking ongoing conversations until it’s recovered, to be fair.

danmg · on June 27, 2018

Small ircds that you would run for a single team don't split because it's a single server.

Large networks can have the servers go up and down, and it's still not a big deal because of redundancy. DNS round-robin entries mean you don't even have to know the other servers on the network.

In 2018 netsplits caused by down links are fairly rare. If you wait six months you might see one.

vertex-four · on June 27, 2018

And if you run a small single-point ircd, at some point, the server or it’s internet connection will fail, and you’re in the same position as when Slack fails.

There’s nothing that gets around technical failure. Either you have a single server that’s going to die at some point due to sheer entropy, or you have a somewhat complex distributed system with the tradeoffs you desire that might fail anyway.

danmg · on June 28, 2018

The downtime would be for a network connection failure and not because your 'fearless' NoSQL container didn't work as expected. If a transient networking problem like this is a big deal for you, you can easily add either more nodes or move the node to a place with more reliable networking.

vertex-four · on June 28, 2018

Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted, or because you accidentally K-lined 0.0.0.0/32, or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something, or a failure to update the system allowed someone to attack your network, or the really hacky mechanism you use to enable auth against Active Directory broke or allowed a disabled user to log in, or...

There’s a lot more that can go wrong than that a database falls over. In my experience, IRC servers fall over all the time - it’s just that nobody really cares because their clients just connect to the next server in the list and people resume their conversations a minute later after figuring out what messages actually reached their destination.

Paying IRCCloud to manage an IRC server for you is a reasonable option, but I wouldn’t do it because I think it’s going to be more available, but because I like IRC and believe it provides the functionality I need.

danmg · on June 29, 2018

> Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted,

Don't use a 20 year old ircd then. Use something like ratbox or InspIRCd.

> or because you accidentally K-lined 0.0.0.0/32 or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something

Don't let 14 year olds run your server.

orf · on June 27, 2018

https://en.wikipedia.org/wiki/Netsplit

mirages · on June 27, 2018

Anybody knows why the netsplit was written

  *.net *.split

Why those stars and dots ?

whafro · on June 27, 2018

I believe it's an effort to show a netsplit in the traditional form (server1, server2) without placing blame on a particular server.

Back when I IRCd regularly (and perhaps this is still the case today), certain servers would get a reputation for splitting more than others, and I think this network (and/or its ircd) decided to mask it without breaking the general format.

nexxer · on June 28, 2018

It's to avoid showing the server names to the public on some IRC networks for various reasons including security through (some) obscurity.

Waterluvian · on June 27, 2018

I recall net splits happening all the time so I'm not sure that ircs uptime was a practical real world "decades".

bonyt · on June 27, 2018

I mean, most IRC networks at least have netsplits from time to time.

rarec · on June 27, 2018

The power of centralization! If I can't have it, you can't either! I wouldn't say it's fragile, though. Just like normal IT work, people only pay attention when it isn't working.

Angostura · on June 27, 2018

Really? It maybe up, but Netsplits occur all the time.

shawn · on June 27, 2018

sdf43543t345 has quit the server (Net split)

microcolonel · on June 27, 2018

Because they're single-source, single-point-of-failure (in the case of Slack), and very very very complex.

matthewmacleod · on June 27, 2018

Major IRC networks had uptime on the scale of minutes, in my experience…

digitalsin · on June 27, 2018

writing entire stacks in javascript, probably

nathanaldensr · on June 27, 2018

Given how much more robust Slack is than IRC as far as features go, it's probably not fragile. The closer a piece of software is to the network layer, the more stable it tends to be, just due to the internet's robustness.

amarraja · on June 27, 2018

Spare a thought for the chatops crowd who may be blissfully unaware that their own infrastructure is down

neuromantik8086 · on June 27, 2018

I have little pity for the folks who decided that re-implementing a shell in a chat application was a well-conceived notion.

amarraja · on June 27, 2018

Hopefully they're learning this lesson.

I can just see it now. A company's app is dying from all the timeouts to a slack webhook, however they can't deploy because slack is down.

bovermyer · on June 27, 2018

Oh god. Who would have a Slack webhook as a blocking part of their pipeline??

yeldarb · on June 27, 2018

It's just login that's not working it seems.

I keep getting automated push notifications from our bots (but I still can't connect to the app myself).

We use it for things like relaying user-flagged messages to our support team and reminding us when scheduled content has been automatically released.

sammorrowdrums · on June 27, 2018

Surely some of the Slack team are hiding on here? Any idea what's going on? ;)

ufmace · on June 27, 2018

If they are, I sure hope they're doing something more productive than surfing HN

sammorrowdrums · on June 27, 2018

Very much agreed. I meant it as a tongue in cheek comment. I am however deeply disappointed by the vague and useless updates on https://status.slack.com/2018-06/142edcb9e52c7663

They might as well have written:

  - nope but maybe at some point a yep
  - still nope
  - nope
  - nope

... I know many companies don't like to give details in the heat of the moment (and the engineers that understand are likely working on it), so I really do hope they give us a good retro after it's all over.

nopriorarrests · on June 27, 2018

Downtimes happen, I get it, but this one lasts for 3+ hours already. Can't even remember the previous time when such a large service was down for so long.

parliament32 · on June 27, 2018

Seems to be fixed... with zero info on their status page about what went wrong or otherwise.

>We're happy to report that workspaces should be able to connect again, as we've isolated the problem. Some folks may need to refresh (Ctrl + R or Cmd + R). If you're still experiencing issues, please drop us a line

Hilariously, their "uptime in the last 30 days" still shows 100%.

james_pm · on June 27, 2018

While I appreciate the timely status updates, it almost seems like Slack has built a random status update bot to post updates that don't say anything exactly every 30 minutes.

https://status.slack.com/2018-06/142edcb9e52c7663

majewsky · on June 27, 2018

Even Slack Enterprise is affected. So much about "this runs on your own infrastructure".

nixgeek · on June 27, 2018

Slack Enterprise has never run on your own infrastructure.

majewsky · on June 27, 2018

It doesn't? I thought that was its only reason to exist.

nixgeek · on June 27, 2018

It’s pitch is a little different, it gives you ‘Workspaces’ which are somewhat connected, and tools to manage big deployments.

IBM, Oracle and many large companies use it because 100,000+ participants in one workspace is quite unmanageable.

Think channel namespacing whilst unifying user provisioning and enabling DM and MPDM across the entire company. Users can have access to one or many namespaces, they sign in once and it populates all enabled workspaces into that users client.

You can share channels between workspace within Enterprise Grid fairly trivially (although this now works between Slack tenancies owned by different companies too!)

Still runs on the same infrastructure in AWS as other Slack customers though.

From a policy perspective you can push down settings to all Workspaces in your SEG, and define whether you “centrally control” or “delegate to Workspace owners” on a setting by setting basis.

fsiefken · on June 27, 2018

for a decentralized alternative: https://matrix.org/blog/home/

Theodores · on June 27, 2018

They have too many 'decentralized', i.e. blockchain, things on their landing page for my liking. However, since blockchain 'technologies' are the wonder kool-aid for everything and given that messaging 'apps' are trivial compared to rocket surgery, how come that there isn't a messaging app that is decentralised with these wonder technologies, were it only costs you a few cryptokitties to get your messages and where you earn a few dogecoin to forward on other peoples messages?

detaro · on June 27, 2018

Just to be clear, the Matrix protocol and technology does not involve Blockchains in any way.

collinvandyck76 · on June 27, 2018

The issue I'm having is that I can't send messages. Reading them is fine. The dashboards on the walls at SlackHQ must be pretty interesting right now.

bshimmin · on June 27, 2018

If you want to stop being able to receive them as well, just hit reload (then you won't be able to reconnect).

geerlingguy · on June 27, 2018

Yeah, a strange failure mode; I was even notified of a new message a few minutes ago, but couldn't post a reply.

chatmasta · on June 27, 2018

This is a weird outage...

- I am in UK

- I had similar problems last night (around 2AM GMT) but status.slack.com was all green, and my colleagues in the US seemed to be using it okay

- Currently it's completely down on desktop for me (waiting to reconnect...)

- Connecting through a US VPN does not resolve the problem on Desktop, even though my US colleagues are using it on Desktop successfully right now

- Mobile works for receiving and sending messages, but there is a delay

Anyone else seeing symptoms like this?

spraak · on June 27, 2018

What does Slack as a company use to communicate when Slack the product is down? :>

parliament32 · on June 27, 2018

Almost definitely IRC, just like Google does. https://landing.google.com/sre/book/chapters/managing-incide...

abdullahdiaa · on June 27, 2018

It's really disappointing to get this update after ~2 hours of downtime

> We have no new information to share just yet, but we're continuing our efforts. Your patience is truly appreciated. https://status.slack.com/2018-06/142edcb9e52c7663

raphaelj · on June 27, 2018

Does anyone know about any Slack alternatives that support bots, GIFYs and emojis ? And above all that is not ridiculously slow.

sahaskatta · on June 27, 2018

There's Google Chat if you use G Suite like us. https://chat.google.com/welcome

Or Atlassian's Stride is really great too: https://www.stride.com/

If your org happens to be part of the Microsoft Office 365 ecosystem, there's Microsoft Teams. All of the products support bots, gifys, and emojis. I personally think Google Chat and Stride are much faster than slack too. I haven't tried Microsoft Teams yet.

Jgrubb · on June 27, 2018

We just spun up Mattermost, seems ok for the first hour.

detaro · on June 27, 2018

https://news.ycombinator.com/item?id=17408147

drewg123 · on June 27, 2018

The funny thing is that sending normal messages just times out 90% of the time. But /me comments generate an error after timeout:

slackbot9:28 AM /me throws shoe at slack failed with the error "ASSocket: timed out reading 4 bytes from adminserver-3wvr:10443"

We've switched to google hangouts as an ad-hoc workaround..

aliljet · on June 27, 2018

Are there solid Slack-style self-hosted alternatives? (The gitlab of the slack world, to be clear.)

mbakke · on June 27, 2018

Matrix (https://matrix.org) is a good alternative to both Slack and Discord. The most complete client implementation is Riot (https://riot.im/).

The protocol itself is federated, so you can communicate with other Matrix users from your self-hosted instance. There are also bridges to IRC, XMPP, even Slack..

JeanFred · on June 27, 2018

Gitlab itself ships with Mattermost <https://mattermost.com/> :)

aliljet · on June 27, 2018

This doesn't look half bad... while slack's down, seems like a pretty optimal chance to try https://github.com/mattermost/mattermost-server

oneneptune · on June 27, 2018

Atlassian, the folks behind bitbucket, jira, and (now) trello have a self hosted product: https://www.atlassian.com/software/hipchat

They also have a Slack alternative, called Stride.

briandear · on June 27, 2018

And HipChat is horrible. Doesn’t even sync which multiple devices and their mobile app doesn’t support the iPhone X screen size which is a trivial update to make considering HipChat is used by some pretty massive customers. Code highlighting is still, (after years,) pretty bad.