I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.
We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.
All this feedback matters. We hear it even when we drop the ball communicating.
What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year.
Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.
Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc.
And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons.
And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.
"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales.
Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.
Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments.
You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS).
We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.
I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.
"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.
I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.
"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.
I'm curious why you don't host your status page on a different domain/provider? When checking this AM why GitHub was down, I also couldn't reach the status page.
The only way that I could check to see if Github knew they were having problems was by searching Google for "github status", and then seeing from the embedded Twitter section in the results page that there was a tweet about having problems. Twitter also being down for me didn't help the situation either.
The attack is on the DNS servers, which take names like www.github.com and resolve them to ip addresses (i.e. 192.30.253.112 for me). Their status page is status.github.com - it is on the same domain name (github.com) as the rest of the site. Normally this isn't a problem because availability is usually something going on with a server, not DNS.
In this case, the servers (DNS server under attack at Dyn) that knows how to turn both www.github.com and status.github.com into an IP address were under attack and couldn't respond to a query. The only way to mitigate this would be to have a completely different domain (i.e. githubstatus.com) and host the DNS with a different company (i.e. not Dyn).
Right, this was my point. Hosting "status.domain.com" doesn't help much when it's "domain.com" that's having the problem. I think today's event will make a lot of companies consider this a bit more.
Anyway, for them to take the github.com nameservers out of the mix they would need a completely separate domain name; would you know to look there?
You can delegate subdomains to other providers, but the NS records are still present in the servers listed in the registrar. So, you'd already need multiple DNS providers.. And you wouldn't have been down. Just sayin. I'm not sure anyone rated a DNS provider of this status getting hit this hard or completely as high enough risk to go through the trouble.
It's easy enough to look at a system and point out all the things you depend on as being a risk. The harder part is deciding which risks are high enough priority to address instead of all the other work to be done.
Lots of companies use Twitter for that sort of real-time status reporting, whose own up/down status one would think is sufficiently uncorrelated... unfortunately the internet is complicated.
If you're attempting to understand the behavior of individual users of HN as a collective, I can assure you that your initial principles are hampering you greatly.
(I'm not Github, but I work for a Dyn customer) Using multiple DNS providers has technical and organizational issues.
From a technical perspective, if you're doing fancy DNS things like geo targetting, round robin though more A records than you'll return to a query, or healtchecks to fail out ips from your rotations, using multiple providers means they're likely to be out of sync, especially if the provider capabilities don't match. That may not be terrible, because some resolvers are going to cache DNS answers for way longer than the TTL and you have to deal with that anyway. You'll also have to think about what to do when an update applied successfully to one provider, but the second provider failed to apply the update.
From an organizational perspective, most enterprise DNS costs a bunch of money, with volume discounts, so paying for two services, each at half the volume, is going to be significantly more expensive than just one. And you have to deal with two enterprise sales teams bugging you to try their other products, asking for testimonials, etc, bleh.
Also, the enterprise DNS I shopped with all claimed they ran multiple distinct clusters, so they should be covered for software risks that come from shipping the same broken software to all servers and having them all fall over at the same time.
more accurately, they don't support the common standard methodologies for transferring zone data between primary and secondary name servers (like NOTIFY, AXFR, etc).
there is nothing stopping you from having Route53 and $others as NS records for your domains. You just have to make sure they stay consistent. Apparently from the linked discussion, there are people offering scripts and services to do just that.
If this is consistently a problem why doesn't Github have fallback TLDs that use different DNS providers? Or even just code the site to work with static IPs. I tried the Github IP and it didn't load, but that could be for an unrelated issue.
> If this is consistently a problem why doesn't Github have fallback TLDs
I don't believe this has been consistently a problem in the past. But after today big services probably will have fallback TLDs.
Another status update from GitHub: "We have migrated to an unaffected DNS provider. Some users may experience problems with cached results as the change propagates."
We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.
Twitter's working fine for me. This attack is affecting different people differently; as a DDOS, attacking a distributed system (DNS) with a lot of redundancy, it's possible for some people to be affected badly while others not affected at all.
I briefly lost access to GitHub, but Twitter has been working fine every time I've checked. Posting status messages in multiple venues helps to ensure that even if one channel is down, people might be able to get status from another channel.
Anycast usually implies traffic will be directed to the nearest node advertising that prefix. The GLB directors leverage ECMP which provides the ability to balance flows across many available paths.
Anycast and ECMP work together in the context of load balancing. ECMP without Anycasted destination IPs would be pointless for horizontally scaling your LB tier.
What Anycast means is just that multiple hosts share the same IP address - as opposed to unicast. When all the nodes sharing the same IP are on the same subnet "nearest" is kind of irrelevant. So the implication is different.
Sure. Feel free to call it anycast then. I usually hear anycast routing used in the context of achieving failover or routing flows to the closest server/POP, but there is probably a more formal definition in an RFC that I'll be pointed to shortly. =)
We are using BGP to advertise prefixes for GLB inside the data center to route flows to the directors. In our case all of the nodes are not on the same subnet (or at least not guaranteed to be) which is one of the reasons why we chose to avoid solutions requiring multicast. I expect Joe and Theo will get into more details about that in a future post though.
This is really cool work, I worked with a team that implemented an ECMP hashing scheme using a set of IPs kept alive by VRRP in a previous lifetime, so I have a bit of familiarity with the space and a few questions.
The article says the L4 layer uses ECMP with consistent/rendezvous hashing. is this vendor implemented or implemented by you using openflow or something similar? How does graceful removal at the director layer work? I know you would have to start directing incoming SYNs to another group, but how do you differentiate non-SYN packets that started on the draining group vs. ones that started on the new group?
If you are using L4 fields in the hash, how do you handle ICMP? This approach could break PMTU discovery because a icmp fragmentation needed packet sent in response to a message sent to one of your DSR boxes might hash to a different box, unless considerations have been made.
A "whole" piece of software would require you to have made the same data center design decisions we've made at GitHub. While some of our choices are opinionated, I think you'll find the GLB architecture adheres to the unix philosophy of individual components each doing one thing well.
Either way, I hope the upcoming engineering-focused posts are interesting and informative! Developing GLB was a challenging engineering project and if open-sourcing it means other companies can benefit from our work and spend more time developing their products, then I'll consider that a success.
I think you're referring to the GitHub Engineering blog post [1] about our git storage tier. We [2] store your code on at least 3 servers, which is an improvement in many ways from our previous storage architecture. There are a lot of servers [3] powering things but not the millions it would require to give every customer three dedicated machines. Developing efficient solutions to problems is a requirement (and a fun challenge!) for anything at GitHub's scale.
GitHub's physical infrastructure team doesn't dictate what technologies our engineers can run on our hardware. We are interested in providing reliable server resources in a easily consumable manner. If someone wants to provision hardware to run docker containers or similar, that's great!
We may eventually offer higher order infrastructure or platform services internally, but it's not our current focus.
We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.
All this feedback matters. We hear it even when we drop the ball communicating.