Hacker News new | past | comments | ask | show | jobs | submit | jssjr's comments login

I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.


What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.


Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?


Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.


Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)


Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way


Could you expand your answer to list vendors which you would recommend?


"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales. Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.


Have you come across Fujitsu PRIMERGY servers before?

https://www.fujitsu.com/global/products/computing/servers/pr...

I used to use them a few years ago in a local data centre, and they were pretty good back then.

They don't seem to be widely known about though.


Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments. You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS). We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.


> You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc.

That does sound pretty useful.

So for yourselves, you rack them then run hardware qualification tests?


I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.

"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.

[x-posted on https://news.ycombinator.com/item?id=12759697 as well]


how did you post on twitter which is down too?


Twitter isn't down, just DNS resolution of twitter.com


Cached DNS response.


I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.

"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.


I'm curious why you don't host your status page on a different domain/provider? When checking this AM why GitHub was down, I also couldn't reach the status page.


+1

The only way that I could check to see if Github knew they were having problems was by searching Google for "github status", and then seeing from the embedded Twitter section in the results page that there was a tweet about having problems. Twitter also being down for me didn't help the situation either.


The attack is on the DNS servers, which take names like www.github.com and resolve them to ip addresses (i.e. 192.30.253.112 for me). Their status page is status.github.com - it is on the same domain name (github.com) as the rest of the site. Normally this isn't a problem because availability is usually something going on with a server, not DNS.

In this case, the servers (DNS server under attack at Dyn) that knows how to turn both www.github.com and status.github.com into an IP address were under attack and couldn't respond to a query. The only way to mitigate this would be to have a completely different domain (i.e. githubstatus.com) and host the DNS with a different company (i.e. not Dyn).


Right, this was my point. Hosting "status.domain.com" doesn't help much when it's "domain.com" that's having the problem. I think today's event will make a lot of companies consider this a bit more.


Hiiiinnnnndsiighhhttttt!!!!! Yeaaaahhhhyeahh!

Anyway, for them to take the github.com nameservers out of the mix they would need a completely separate domain name; would you know to look there?

You can delegate subdomains to other providers, but the NS records are still present in the servers listed in the registrar. So, you'd already need multiple DNS providers.. And you wouldn't have been down. Just sayin. I'm not sure anyone rated a DNS provider of this status getting hit this hard or completely as high enough risk to go through the trouble.

It's easy enough to look at a system and point out all the things you depend on as being a risk. The harder part is deciding which risks are high enough priority to address instead of all the other work to be done.


I mean, some organizations do take precautions against this point of failure and use a separate status domain. Most don't.

https://www.dynstatus.com/ (using Route 53, at least today)

https://www.cloudflarestatus.com/ (using Dyn, ironically)


If it helps any, this link seems to work for me to reach the github status page (requires https certificate override, of course):

https://107.22.212.99/


Lots of companies use Twitter for that sort of real-time status reporting, whose own up/down status one would think is sufficiently uncorrelated... unfortunately the internet is complicated.


+1 Logical question!


This is what you can do to restore your GitHub access:

    grep github ~/.ssh/known_hosts
    sudo vim /etc/hosts
    sudo killall -HUP mDNSResponder
    ping github.com


I added

192.30.253.112 github.com

but https://assets-cdn.github.com is failing

EDIT: Use 192.30.253.112 github.com 151.101.24.133 assets-cdn.github.com

or try 8.8.8.8 DNS


Why am I being downvoted for providing useful information? I don't understand HN...


Probably because you say to edit /etc/hosts but not what the content should be.


Is it hard to guess? The output of grep isn't a hint?


…except they did though, at least if you've sshed into github at some point (which I think nearly everyone has).


If you're attempting to understand the behavior of individual users of HN as a collective, I can assure you that your initial principles are hampering you greatly.


Not sure if people aren't OK with the content but you've posted it twice, which is not really cool with most people or the guidelines.

Also probably the "hijacking top comment" part.

The other occurrence being here: https://news.ycombinator.com/item?id=12760156


May not be HN doing the downvotes my friend.


seems like the right thing to do. however the ip address itself won't respond either


Just being curious, why don't you use different DNS servers?


(I'm not Github, but I work for a Dyn customer) Using multiple DNS providers has technical and organizational issues.

From a technical perspective, if you're doing fancy DNS things like geo targetting, round robin though more A records than you'll return to a query, or healtchecks to fail out ips from your rotations, using multiple providers means they're likely to be out of sync, especially if the provider capabilities don't match. That may not be terrible, because some resolvers are going to cache DNS answers for way longer than the TTL and you have to deal with that anyway. You'll also have to think about what to do when an update applied successfully to one provider, but the second provider failed to apply the update.

From an organizational perspective, most enterprise DNS costs a bunch of money, with volume discounts, so paying for two services, each at half the volume, is going to be significantly more expensive than just one. And you have to deal with two enterprise sales teams bugging you to try their other products, asking for testimonials, etc, bleh.

Also, the enterprise DNS I shopped with all claimed they ran multiple distinct clusters, so they should be covered for software risks that come from shipping the same broken software to all servers and having them all fall over at the same time.


Most services, even if they aren't the size of Github, can't change their DNS provider on a dime.


It's not a question of switching; you can host your DNS records at multiple providers.


yup, that's what I meant. they can use different DNS providers, e.g. route53 AND dyn


Route53 doesn't allow using it as slave DNS. https://forums.aws.amazon.com/thread.jspa?threadID=56011


more accurately, they don't support the common standard methodologies for transferring zone data between primary and secondary name servers (like NOTIFY, AXFR, etc).

there is nothing stopping you from having Route53 and $others as NS records for your domains. You just have to make sure they stay consistent. Apparently from the linked discussion, there are people offering scripts and services to do just that.


Keeping Serial numbers in sync can be basically impossible.


Serial numbers don't matter if you're not using NOTIFY/AXFR.


Thats why you should have a different domainname

githubstatus.com instead of status.github.com

You could even through the domain on a free DNS service.


Maybe not, but you can store your records in a local place and push to both.

That's one of the reasons I setup a git -> Route53 setup at https://dns-api.com/


If this is consistently a problem why doesn't Github have fallback TLDs that use different DNS providers? Or even just code the site to work with static IPs. I tried the Github IP and it didn't load, but that could be for an unrelated issue.


> If this is consistently a problem why doesn't Github have fallback TLDs I don't believe this has been consistently a problem in the past. But after today big services probably will have fallback TLDs.


Another status update from GitHub: "We have migrated to an unaffected DNS provider. Some users may experience problems with cached results as the change propagates."

We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.

Latest status message: https://twitter.com/githubstatus/status/789565863649304576


I love how the White House & GH posted a statement on Twitter.. that we can't access since its down.


Twitter's working fine for me. This attack is affecting different people differently; as a DDOS, attacking a distributed system (DNS) with a lot of redundancy, it's possible for some people to be affected badly while others not affected at all.

I briefly lost access to GitHub, but Twitter has been working fine every time I've checked. Posting status messages in multiple venues helps to ensure that even if one channel is down, people might be able to get status from another channel.


I wish you guys used statuspage or at least allowed email updates for the status of GitHub services.


Anycast usually implies traffic will be directed to the nearest node advertising that prefix. The GLB directors leverage ECMP which provides the ability to balance flows across many available paths.


Anycast and ECMP work together in the context of load balancing. ECMP without Anycasted destination IPs would be pointless for horizontally scaling your LB tier.

What Anycast means is just that multiple hosts share the same IP address - as opposed to unicast. When all the nodes sharing the same IP are on the same subnet "nearest" is kind of irrelevant. So the implication is different.


Sure. Feel free to call it anycast then. I usually hear anycast routing used in the context of achieving failover or routing flows to the closest server/POP, but there is probably a more formal definition in an RFC that I'll be pointed to shortly. =)

We are using BGP to advertise prefixes for GLB inside the data center to route flows to the directors. In our case all of the nodes are not on the same subnet (or at least not guaranteed to be) which is one of the reasons why we chose to avoid solutions requiring multicast. I expect Joe and Theo will get into more details about that in a future post though.


Are you running Quagga or Bird on the director instances then? I'm looking forward to reading more about it.


We use Quagga.


This is really cool work, I worked with a team that implemented an ECMP hashing scheme using a set of IPs kept alive by VRRP in a previous lifetime, so I have a bit of familiarity with the space and a few questions.

The article says the L4 layer uses ECMP with consistent/rendezvous hashing. is this vendor implemented or implemented by you using openflow or something similar? How does graceful removal at the director layer work? I know you would have to start directing incoming SYNs to another group, but how do you differentiate non-SYN packets that started on the draining group vs. ones that started on the new group?

If you are using L4 fields in the hash, how do you handle ICMP? This approach could break PMTU discovery because a icmp fragmentation needed packet sent in response to a message sent to one of your DSR boxes might hash to a different box, unless considerations have been made.


A "whole" piece of software would require you to have made the same data center design decisions we've made at GitHub. While some of our choices are opinionated, I think you'll find the GLB architecture adheres to the unix philosophy of individual components each doing one thing well.

Either way, I hope the upcoming engineering-focused posts are interesting and informative! Developing GLB was a challenging engineering project and if open-sourcing it means other companies can benefit from our work and spend more time developing their products, then I'll consider that a success.


I think you're referring to the GitHub Engineering blog post [1] about our git storage tier. We [2] store your code on at least 3 servers, which is an improvement in many ways from our previous storage architecture. There are a lot of servers [3] powering things but not the millions it would require to give every customer three dedicated machines. Developing efficient solutions to problems is a requirement (and a fun challenge!) for anything at GitHub's scale.

[1] http://githubengineering.com/introducing-dgit/

[2] I'm a GitHubber. https://github.com/jssjr

[3] https://twitter.com/GitHubEng/status/730429227896463360


It was another post about someone using Github to host "packages".


This is really great work. Do you have any plans to open source some (or all) of the code behind Silverton?


GitHub's physical infrastructure team doesn't dictate what technologies our engineers can run on our hardware. We are interested in providing reliable server resources in a easily consumable manner. If someone wants to provision hardware to run docker containers or similar, that's great!

We may eventually offer higher order infrastructure or platform services internally, but it's not our current focus.



We use collectd extensively and it is wonderful software. Brubeck and collectd do very different jobs.


What does Brubeck do that the collectd statsd plugin can't?

https://collectd.org/wiki/index.php/Plugin:StatsD


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: