In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.
Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.
Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.
Google is expert at designing services which you won't notice when there is downtime.
Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.
The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.
The reality is this is how you design highly available systems, and it is also imo one of the reasons microservices have gained so much popularity.
Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.
An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.
Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.
Oh, interesting. Thanks for pointing this out. I was having Chromecast trouble this morning and didn't even think to check if it was a widespread issue.
GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.
Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.
They recently pushed out an iOS update for Messenger that crashed to springboard any time you tried to resume it from background. It took a couple of hours to get a new build up, plus however long for affected users to all install the new version.
I'd love to hear the story of how that made it through testing.
Sorry, should have just said "home screen" for clarity, but SpringBoard is the iOS application that makes the home screen. It's akin to Finder.
A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.
Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.
My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.
Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.
It's the same for all sites beyond a certain size. It's never fully up. It's very rarely fully down. It's gradually degraded in ways that you hopefully don't see, but sometimes do. Or maybe you don't see it, but others do. etc etc etc. Availability isn't boolean once you have users.
> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.
How many SREs does Google have on said IRC system?
How many SREs are at Slack, working on keeping their systems up?
Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?
I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.
Yeah, stripe does the same thing with their status page. I get alerts that they have an outage at least once a week and more often than not it never shows up as anything in their history. Honestly this is my only significant beef with the service and I've been using it for years now with multiple integrations.
You know how much of the community uses one messaging system when 15 minutes after it going down, it has over 40 points on the front page!
This says a lot about how it's a single point of failure in modern company comms.
It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...
ITT: Chat about decentralisation that will ultimately lead to no action.*
*Because we've had this discussion so many times before...
Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.
> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.
Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.
In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.
- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.
- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.
- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.
This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.
I think it's inexcusable for a chat program to go down in 2018.
* your hdd failed? Use a raid
* your power went out? Use a UPS
* your DNS went down? Use a fallback (slack2)
* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over
See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".
And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.
>slack is the most trivial software you can think of
This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.
Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:
1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!
or
2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.
3) slack will get their "chat as a service" monthly fee whether the service actually works or not, so why commit to higher levels of service? We can get our users acclimated to outages and then sell them "slack Premium, for Serious Business", charge an even higher fee, and get stupendously rich all over again. This is the "growth" that investors demand, no?
The dark truth is I suspect we're moving in the opposite direction. Abstraction layers designed with that "chill, it's just a %s app" mindset are making their way into safety critical applications.
Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.
Slack is text, channels, images, video, sound, search, audio calls, video calls, screen share (and interface share), bots, myriad integrations, and more. Calling it just "send text from one computer to another" is wrong.
If trying to provide all the other things besides text causes the system to be unstable, then maybe those things shouldn't have been added. We need text. We just want the other things.
Let me add more reasons:
1) Software human mistake, when some software error/exception throws much larger issues, that require manual restore with service downtime.
2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.
3) Bad system design, full of "one point of failure".
That's bordering on (if not crossing into) ad-hominem.
There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.
They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.
You propose just to buy servers in 2 locations to keep Slack services up? Doesn't work, when you need to store gigabytes daily and have dozen thousand reqs/sec synchronized.
Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.
Although I agree with your premise, I think the delivery takes away from your point a bit.
Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement [1].
Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.
So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.
[1] Although it had other criticisms, such as monetization, which are, naturally, ignored.
They very likely have all of these protections in place, and more. Large-scale outages of mature systems are almost always a cascade of small human errors that, each on their own, would have caused negligible damage. It's only when they happen to align with each other that a large disaster is realized.
I worked at an open source company where they hosted their own IRC server. There are OSS alternatives to Slack and I wonder if that company has tried to adopt any of them.
This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).
If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).
That's uptime-as-anecdote. Yes, you can throw your entire IT department at your outage instead of waiting on the vendor to fix it. How many of us work somewhere where the entire IT team is as large as the team that works on Slack's uptime?
Let's say the self hosted chat app does go down. Now someone has to fix it. Someone who probably has something better to do. In a cloud hosted solution, the person in charge of fixing your computer doesn't work for you.
My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.
I'm not sure about production dependent, but I'd love to see how many other companies have longer/worse outages thanks to this. There are definitely a lot of people counting on Slack as a sole channel to push low-level error notifications, and I doubt most of them have an easy fallback option.
reading all this thread made me realize at my company (~50 people) we have a couple slack-bots that control a number of things, deploys being one of them. shrug
To me it raises a concern, chatops and slack integrations are /very/ common, it's a form of vendor lock-in on their side and it makes absolute sense.
However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.
I say this as someone who almost always prefers the dark theme wherever it is available: I wonder how much this desire for dark interfaces comes from almost every app interface having bright colors on white.
Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.
In civil engineering circles, it's known that a room which is too bright will cause eye strain and fatigue. There is an optimal level of light for the eyes to be most effective. But the computer makers and UI designers don't take this into account. Dark themes transmit less light to the eyes, causing less fatigue over time.
The trick is this" if you look into a bright light you cant see the rest of the room anymore. Its a kind of forced feeding. Not that the designers of our world are guilty of some sinister CONSPIRACY. They simply see awful-white as the only choice based on tests or they are just imitating what they know.
Sure, it's worrying but worth it for me personally. I might go to jail due to this (seriously) but at least people won't die. For me that's the threshold.
Yes, that way we can beat them up for years to come based on whatever mistake they made. It would be even better if they told us which employee made the mistake so we can incessantly mock that employee openly and publicly every time Slack is ever mentioned on HN. When GitHub was purchased by Microsoft, Gitlab came up quite a bit and we got to rehash that whole database outage over again many times over those few days. It was sad.
If it were my company, I would say a little as humanly possible.
It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.
It’s not about assigning blame for the company writing the post-mortem. But it’s definitely about assigning blame for most people reading the post-mortem. Very few people read post-mortems for the sake of learning how to be better at release engineering and ops.
If I pay for your service, and you are transparent about mistakes and flaws, I will be more forgiving about mistakes and flaws in the future, and appreciate the work you do to fix them.
If I pay for your service, and the only communication is, "We know there is a problem, and we'll let you know when it's fixed", I may assume you are not equipped to thoroughly explain the problem, and therefore not well equipped to solve it.
The blame is already assigned. The users already know there is a problem. A post-mortem likely has a positive effect for the readers attitude toward the handling of the issue.
It’s more the people who don’t pay for the service, but might, that are quickest to see post-mortems in a negative light. The only reason they have for reading them is looking for justifications for culling the product/service from the list of contenders for when they ever have to evaluate solutions in that category.
In other words: post-mortems are good PR, but incredibly bad advertising.
And a world-wide outage followed by "we fixed it and trust us it won't happen again" is going to filter any service off of my list more so than "we had a single point of failure running in our CTO's basement and his cleaning lady pulled the plug. Trust us it won't happen again."
I entirely understand what you are saying, believe me I do. But that is not the way some communities take it. We still see messages like "You could move to Gitlab but... you know they dropped their production database a couple of years back? Use them at your own risk!"
We learned a lot from the Gitlab outage. It was a simple mistake and not one they will have again, yet people still beat them up for it. I'm not sure the value is there for the company to be super open about their outages and issues.
Perhaps - but would you even remember it, without the juicy details of what happened? I probably would forget if some service had a few hours downtime a year or two ago, if I didn't know any details to make it stand out from other outages.
Wouldn't they have gotten beaten up over the outage even more had they not offered an explanation?
In my experience, customers are often seeking an explanation/post-mortem because their customers are seeking an explanation. If an upstream service goes down for an extended period of time and all you can do is go back to your customers and say, "Your system was down because our provider's system went down for 4 hours. But they won't tell us why.", that not go over well.
Gitlab's response to the the database mistake was a large contributing factor in my decision to move all of my repositories onto their service.
Anecdotal, sure, but people like me exist. I don't know if we're in the majority. You'd have to measure somehow and do a cost-benefit analysis I guess.
As usual people are taking a comment and twisting it any old way they'd like. Which is fine, that's why we have these communications. To start off, no I am not in aviation. I have run quite a few companies and development departments.
I am not suggesting Slack or anyone else should not communicate at all when they have an outage. A public postmortem, which many people are asking for, is one method. Is it the most effective method? I doubt it. Many people are suggesting that as paying customers they would like to know what happened. Does a public postmortem tell the paying customer what happened in an effective way? Maybe, but maybe not.
When I am running a company I care very much what my paying customers think and are feeling about my service. I will communicate issues directly to them. Do I need to explain to the rest of the world in some great technical detail what happened during an incident? Absolutely not. Do I need to have the first post in Google about my company be an outage postmortem? Of course not. I need my PAYING customers to be pleased with the service I offer and to understand how I will mitigate the damage I have done to them. To me, that's a basic principle of business. I don't have to explain to everyone. I owe everything to my paying customers. Gitlab did a postmortem almost immediately after a major outage and some people tried to slaughter them with the information they shared. It was sad and unfortunate. Their openness was met with some horrible results from the community.
Also, I use Slack. My company uses it for everything including ChatOps for my production environment deployment. We have a hundred of so active users. The outage this morning harmed us. But you know what? I don't pay for Slack. I owe a lot to Slack but they don't owe me anything. I can't blame them for my problems this morning. They are a free service to me. I appreciate that their absolutely free service servers my company so well almost all of the time.
Excellent! If you somehow read my entire message and got out of it that Slack shouldn’t give you detail about the outage this morning, then I somehow did not portray how important it is to emplain issues and resolutions to paying customers. I hope you get a full break down and understand exactly how they will keep you from having this sort of outage again. If they don’t, then it becomes a value issue to decide whether you should move to another system.
My point is only that it does not have to be a large public explanation. You, or the decision maker at your company, who pays a substantial sum of money to slack for their service, should have an explanation until you are satisfied.
Maybe unrelated, but my AWS-hosted websockets-using app had an outage starting at the same time. Also a third-party API provider we use for handling inbound phone calls. So this smells like a wider outage than just Slack.
When I was in Moscow a few weeks back, Slack wouldn't work. Exact same behaviour - it loaded up the gui, loaded up previous conversations, but then wouldn't work past there.
Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.
That's interesting. More speculation: they haven't given any detail in 2 hours, perhaps if it's an upstream/3rd-party problem, they haven't been given any info.
I know it's not exactly scientific, but the front page of https://downdetector.com shows a number of services that have problem spikes starting anywhere from 3am US/Eastern to 9am US/Eastern and continuing through now (11:24 US/Eastern): Google Home, Fortnite, Exede, Level 3, New York Times, AWS. Maybe totally unrelated to each other, who knows.
I'm wondering the same thing. I chose this morning to soft-launch my side-project/startup and sent out the sign-up link to my e-mail list. Of course, it's AWS Cognito-based, was working yesterday, but failed for the new users. Great timing! Phone support said they are looking into some outages (even though the status page is all green).
Maybe I'm reading too much into it, but "We've received word that all workspaces are having troubles connecting to Slack." makes it sound like their internal monitoring didn't catch whatever is causing this. I was personally experiencing issues for about 20-30 minutes before the status update was posted.
Pretty much every time there's a slack outage it takes them a solid 20 minutes to update their status page. Several times I've emailed them 10 minutes into an outage (following "nobody at the office can reach slack, but their status page says smooth sailing, we should do more diagnostics in case it's office internet or something..."), then gotten a response 10 minutes later to the tune of "we're aware, we just updated our status page, go look at that". I think they consider updating their status page a PR problem, so they avoid if if the issue can be fixed in under X minutes.
Which also makes their uptime totals completely bogus.
It's interesting to me that the update messages are posted every 30 minutes from 1st notification until resolution. Judging by this and every other outage I assume this is automatic, and probably implemented to appease the people who are probably frustrated by the outage.
We use Slack for everyday company wide communication/ announcements and Riot for encrypted secure communications (you can host Riot yourself): https://about.riot.im/
It's not about the protocols, it's about having a client with a user experience that is acceptable to an entire company rather than just a team of engineers. Which decentralized protocol has such a client? (Speaking as someone who got burned trying to advocate for IRC at a company that eventually and inevitably switched to Slack.)
The multitude of clients is one of the problems! How do you find them? Which one do you use? What features matter? Nobody knows! They just want a product with chat rooms and don't understand why it seems so hard to do seemingly simple stuff like create an account or search for that link that someone posted a month ago.
Technical people who haven't used IRC can barely figure out IRC their first time using it. Trying to sell IRC to a company would be hilarious. Bob in Accounting getting on IRC and feeling comfortable with it's UX?
A hypothesis I like is that when it's an application you use to communicate with other people, people are a lot less tolerant of a confusing UX.
The reason is that when you sit there clicking through a bunch of menus to find something in QuickBooks (or a typical atrocious enterprise app), nobody sees you; and if you screw something up there, you spend some more time fixing it and nobody sees the screwup. Frustrating maybe, as you waste time, but almost everyone has some frustrating wastes of time at work.
If you're on IRC and people are talking at you and you sit there fumbling to figure out how to respond, it's like you're in a conversation and tongue-tied and everyone's looking at you. And if you screw something up, like send a message to the wrong channel... now you've done it in front of all your coworkers, in real time. Humans hate looking stupid in front of the group.
And if you screw something up on IRC in front of your coworkers, and you're someone with even a little anxiety about not being tech-savvy... that's going to flare right up.
Also, because now you're embarrassed, you're going to want something to blame. So you blame the tool.
Yes. Also, QuickBooks is accounting, which is supposed to be hard while "chatting" with people is supposed to be easy.
QuickBooks doesn't have to suffer in comparison to better UX performing similar tasks in people's personal lives while IRC can be compared (unfavorably) to texting apps, Facebook Messenger, Twitter, AIM once upon a time, etc.
mIRC offers a fairly good UX compared to all those.
If you're setting it up in a corporate environment, just change the ini files so it autoconnects to your server. It'll pop up a list of channels they can join. The server can SAJoin them to particular channels on connection too. The UI is very clean and lightweight: a channel scrolls messages and they appear, there's an input bar at the bottom, and there's a list of users on the side. It's written in MFC and Win32 APIs, so it's blazingly fast compared to most applications, and you can find a version that will run on every computer made in the past 25 years.
The united states military used mIRC extensively for battle field coordination. I think it's up to the task of handling bob from accounting.
An image search for mIRC shows that it is ugly as shit. It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation. Stored communication is mainly a server-side problem but I don't know if mIRC has an interface to show DMs you missed while offline or to indicate which part of a channel's conversation happened since you last looked.
Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?
The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.
> An image search for mIRC shows that it is ugly as shit.
ok
> It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation.
No.
Each channel and private message get their own MDI window you are free to minimize, maximize or layout however you want.
Notifications are turned on by default, but they can be disabled. You'll get a tray notification if mIRC is minimized, and inside the title bar of the window will flash. Notifications happen when your nick is mentioned.
There's a horizontal line that goes across the dialog window that indicates the location of the conversation the last time it was focused.
>Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?
Other clients work on other platforms. mIRC is just what I brought up since it's desktop windows client and that the most common case for an office environment.
> The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.
It's a simple, light-weight way for people to send short text messages in near real time with tens of thousands of people. I think that's good enough, and it works at a scale that far surpasses the SaaS chat options.
I don't think it's a problem if something needs to be initially deployed and configured by an IT department (or otherwise tech savvy individual or group), as long as its onboarding and primary usage flows are straightforward. An arbitrary non-tech-savvy but internet-familiar employee needs to be able to create an account, browse and join rooms, and search through history without any hand-holding. Slack and its direct competitors pass this test. IRC doesn't. Does Spark?
It's certainly the closest of all XMPP clients I've used, since it has a very friendly interface. Their related Openfire XMPP server is also targeted at internal deployments and is very easy to configure with a web UI.
Their site (https://matrix.org) reeks of hype-oriented engineering. From the most cursory overview of their home page, their decentralization looks a lot like IRC peering.
If you are interested we are building a communication platform for communities fully based on XMPP https://movim.eu/ :) It can easily be deployed on a Web server.
Despite having a vote increment velocity far exceeding other items, a publish time of only 25 minutes ago, and more points, this item just dropped from #5 to #7 on the front page.
How’s that work exactly?
Edit: It’s now droppped to #14 even with comment count also rapidly increasing.
> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.
> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.
Eh, IRC networks split and individual servers went down all the time. But yes, there rarely was a complete EFNet outage even if sometimes there were 2 versions of the same channel going at once.
That being said although I like some slacks fancy features I do wish a distributed alternative could catch on.
Native emoji support, aesthetically pleasing front-ends, and clear product direction are some of the main positives I see, even if the combination of php on the backend and electron on the frontend aren't the most sophisticated technical components in history.
I prefer decentralized and open things, but a cohesive vision can sometimes provide a better user experience across a more restricted set of functionality than an army of hackers, each solving their own problems.
Have you tried IRCCloud? Their web based front-end is as nice as Slack's but it still works with decentralized IRC servers. They also manage the client's state (unread messages) better than regular IRC bouncers.
Emoji seem to work just fine on IRC nowadays, what do you mean by "native" support? The shortcodes? The fact that there's official clients you can entirely rely on supporting it?
Native emoji support, pretty front-ends, and clear product direction are possibilities on-top of IRC (or XMPP) since their absence isn't a core part of IRC (or XMPP) -- it's just not a good way to make a profit it if you don't lock down the network and act as the gatekeeper of the interface. Slack's API is fairly open though and it's not a huge hurdle to interact with it. I built an IRC<->Slack gateway that bridges the differences fairly well ( https://slack.tcl-lang.org/ , you know, if Slack were working).
Small ircds that you would run for a single team don't split because it's a single server.
Large networks can have the servers go up and down, and it's still not a big deal because of redundancy. DNS round-robin entries mean you don't even have to know the other servers on the network.
In 2018 netsplits caused by down links are fairly rare. If you wait six months you might see one.
And if you run a small single-point ircd, at some point, the server or it’s internet connection will fail, and you’re in the same position as when Slack fails.
There’s nothing that gets around technical failure. Either you have a single server that’s going to die at some point due to sheer entropy, or you have a somewhat complex distributed system with the tradeoffs you desire that might fail anyway.
The downtime would be for a network connection failure and not because your 'fearless' NoSQL container didn't work as expected. If a transient networking problem like this is a big deal for you, you can easily add either more nodes or move the node to a place with more reliable networking.
Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted, or because you accidentally K-lined 0.0.0.0/32, or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something, or a failure to update the system allowed someone to attack your network, or the really hacky mechanism you use to enable auth against Active Directory broke or allowed a disabled user to log in, or...
There’s a lot more that can go wrong than that a database falls over. In my experience, IRC servers fall over all the time - it’s just that nobody really cares because their clients just connect to the next server in the list and people resume their conversations a minute later after figuring out what messages actually reached their destination.
Paying IRCCloud to manage an IRC server for you is a reasonable option, but I wouldn’t do it because I think it’s going to be more available, but because I like IRC and believe it provides the functionality I need.
> Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted,
Don't use a 20 year old ircd then. Use something like ratbox or InspIRCd.
> or because you accidentally K-lined 0.0.0.0/32 or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something
I believe it's an effort to show a netsplit in the traditional form (server1, server2) without placing blame on a particular server.
Back when I IRCd regularly (and perhaps this is still the case today), certain servers would get a reputation for splitting more than others, and I think this network (and/or its ircd) decided to mask it without breaking the general format.
The power of centralization! If I can't have it, you can't either!
I wouldn't say it's fragile, though. Just like normal IT work, people only pay attention when it isn't working.
Given how much more robust Slack is than IRC as far as features go, it's probably not fragile. The closer a piece of software is to the network layer, the more stable it tends to be, just due to the internet's robustness.
- nope but maybe at some point a yep
- still nope
- nope
- nope
... I know many companies don't like to give details in the heat of the moment (and the engineers that understand are likely working on it), so I really do hope they give us a good retro after it's all over.
Downtimes happen, I get it, but this one lasts for 3+ hours already. Can't even remember the previous time when such a large service was down for so long.
Seems to be fixed... with zero info on their status page about what went wrong or otherwise.
>We're happy to report that workspaces should be able to connect again, as we've isolated the problem. Some folks may need to refresh (Ctrl + R or Cmd + R). If you're still experiencing issues, please drop us a line
Hilariously, their "uptime in the last 30 days" still shows 100%.
While I appreciate the timely status updates, it almost seems like Slack has built a random status update bot to post updates that don't say anything exactly every 30 minutes.
It’s pitch is a little different, it gives you ‘Workspaces’ which are somewhat connected, and tools to manage big deployments.
IBM, Oracle and many large companies use it because 100,000+ participants in one workspace is quite unmanageable.
Think channel namespacing whilst unifying user provisioning and enabling DM and MPDM across the entire company. Users can have access to one or many namespaces, they sign in once and it populates all enabled workspaces into that users client.
You can share channels between workspace within Enterprise Grid fairly trivially (although this now works between Slack tenancies owned by different companies too!)
Still runs on the same infrastructure in AWS as other Slack customers though.
From a policy perspective you can push down settings to all Workspaces in your SEG, and define whether you “centrally control” or “delegate to Workspace owners” on a setting by setting basis.
They have too many 'decentralized', i.e. blockchain, things on their landing page for my liking. However, since blockchain 'technologies' are the wonder kool-aid for everything and given that messaging 'apps' are trivial compared to rocket surgery, how come that there isn't a messaging app that is decentralised with these wonder technologies, were it only costs you a few cryptokitties to get your messages and where you earn a few dogecoin to forward on other peoples messages?
If your org happens to be part of the Microsoft Office 365 ecosystem, there's Microsoft Teams. All of the products support bots, gifys, and emojis. I personally think Google Chat and Stride are much faster than slack too. I haven't tried Microsoft Teams yet.
Matrix (https://matrix.org) is a good alternative to both Slack and Discord. The most complete client implementation is Riot (https://riot.im/).
The protocol itself is federated, so you can communicate with other Matrix users from your self-hosted instance. There are also bridges to IRC, XMPP, even Slack..
And HipChat is horrible. Doesn’t even sync which multiple devices and their mobile app doesn’t support the iPhone X screen size which is a trivial update to make considering HipChat is used by some pretty massive customers. Code highlighting is still, (after years,) pretty bad.
This is why I like working on iOS apps, if all hell breaks lose I can't do anything. When something like Slack goes belly up, imagine those poor folks having to respond.
There are so many open source solutions, Mattermost, Rocketchat, etc, why are companies willing to pay $100k/yr for Slack? What was the defining feature that others didn't have? Even Discord feels like it has far more features than Slack.
Name recognition, employee familiarity (I've used Slack at every job I've worked since early 2015, I pretty much know what to expect from it always), and punting maintenance costs (this is probably the biggest factor).
I love IRC and XMPP. I'd love to run one of those, or some new service (Matrix?), at work. However, my time is arguably better spent doing anything _other_ than maintaining such services, and the same goes for most engineers at most companies, sadly.
Side factor: the mobile clients for IRC and XMPP almost universally suck, at least on Android. I imagine if those problems had been solved in a reliable way, more companies may consider them (assuming the allocation of engineering resources problem isn't a problem).
There may not be one defining feature. For some it may be look and feel/attention to detail. It could be the number of well supported Slack integrations, or various enterprise features wrt message retention and deletion, SSO and auditing.
That's a lot of eggs in their basket. A major downside of the current way a lot of these companies work is that there's a huge incentive for them to never allow customers to self-host their product.
Reasons my self-hosted servers have gone down in the past year:
- Scheduled electrical maintenance that facilities manager failed to disclose (even though they knew about it for weeks).
- Emergency power-down because two of the four air conditioners failed at the same time.
- Someone accidentally powered off the VM.
I'd much rather have an hour long outage here and there than incur the cost of defending against these circumstances (and still have it go down for some new unforeseen reason).
how is that self-hosting when you don't control the hypervisor in this case?
it usually implies that you at least have some sort of control. either having a real server somewhere (with ups and stuff) or at home, where you know when power is out.
while what you are doing is technically self-hosting, I would have changed the VM provider after the first incident like you described.
You just gave a good argument of why s/he should use Slack. Yeah s/he might be doing it wrong, but so what? One should focus on core business, not system administration.
Yup, every time one of the popular centralized XaaS platforms go down, there's always the snarky "Heh, well my stuff is self-hosted...", and they are always the types that have no idea how to value their time.
I don't see it as too much of a problem so long as you're not one of those teams that orchestrates their deployments using a Slack bot. Just make sure you have an agreed backup mechanism, i.e. Google Hangouts, Zoom, etc.
If only there were a venerable, decentralized instant messaging system we could use, perhaps some kind of internet relay chat system...
</sarcasm>
You reap what you sow. Depending on Slack for your communications is a bad idea. I can't even remember the last time any of the IRC networks I frequent had a total outage.
Irc left out federation when the spec was crafted in 1984. It was deemed to bandwidth heavy.
Jabber/xmpp was a good step in the right direction. Too bad it overused plugins and bad extensions and XML abuse. Would have been loads better had they though far enough in advance.
It's not federated, but it is distributed and fault tolerant. The protocol is open and widely implemented and the implementations are mature and stable.
I mean, Slack isn't federated either. I don't know of any chat platforms that are federated except Jabber. edit Gchat and AIM federated in 2011, but AIM is dead now, so...
In practice, all the attempts I've seen at getting a significant number of non-techies in a company to use IRC have failed. At one company, we almost got everyone on Jabber, but it was never used much outside the tech circles. Slack? Everyone is using it and most seem happy about it, and it does a lot more than IRC.
https://news.ycombinator.com/item?id=16108912 - 5 months ago (longer discussion)
https://news.ycombinator.com/item?id=15597461 - 7 months ago
https://news.ycombinator.com/item?id=15597431 - 8 months ago
https://news.ycombinator.com/item?id=13811815 - 1 year ago
https://news.ycombinator.com/item?id=10616743 - 3 years ago