More

jssjr · on July 21, 2023

I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.

skullone · on July 21, 2023

What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.

justinclift · on July 21, 2023

Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?

skullone · on July 21, 2023

Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.

justinclift · on July 21, 2023

Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)

skullone · on July 21, 2023

Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way

timc3 · on July 21, 2023

Could you expand your answer to list vendors which you would recommend?

skullone · on July 21, 2023

"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales. Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.

justinclift · on July 21, 2023

Have you come across Fujitsu PRIMERGY servers before?

https://www.fujitsu.com/global/products/computing/servers/pr...

I used to use them a few years ago in a local data centre, and they were pretty good back then.

They don't seem to be widely known about though.

skullone · on July 21, 2023

Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments. You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS). We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.

justinclift · on July 22, 2023

> You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc.

That does sound pretty useful.

So for yourselves, you rack them then run hardware qualification tests?

jssjr · on Oct 21, 2016

I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.

"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.

[x-posted on https://news.ycombinator.com/item?id=12759697 as well]

cynosurexy · on Oct 21, 2016

how did you post on twitter which is down too?

SteveNuts · on Oct 21, 2016

Twitter isn't down, just DNS resolution of twitter.com

eric_the_read · on Oct 21, 2016

Cached DNS response.

jssjr · on Oct 21, 2016

I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.

"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.

cddotdotslash · on Oct 21, 2016

I'm curious why you don't host your status page on a different domain/provider? When checking this AM why GitHub was down, I also couldn't reach the status page.

ToastyMallows · on Oct 21, 2016

+1

The only way that I could check to see if Github knew they were having problems was by searching Google for "github status", and then seeing from the embedded Twitter section in the results page that there was a tweet about having problems. Twitter also being down for me didn't help the situation either.

redbeard0x0a · on Oct 21, 2016

The attack is on the DNS servers, which take names like www.github.com and resolve them to ip addresses (i.e. 192.30.253.112 for me). Their status page is status.github.com - it is on the same domain name (github.com) as the rest of the site. Normally this isn't a problem because availability is usually something going on with a server, not DNS.

In this case, the servers (DNS server under attack at Dyn) that knows how to turn both www.github.com and status.github.com into an IP address were under attack and couldn't respond to a query. The only way to mitigate this would be to have a completely different domain (i.e. githubstatus.com) and host the DNS with a different company (i.e. not Dyn).

cddotdotslash · on Oct 21, 2016

Right, this was my point. Hosting "status.domain.com" doesn't help much when it's "domain.com" that's having the problem. I think today's event will make a lot of companies consider this a bit more.

Rapzid · on Oct 22, 2016

Hiiiinnnnndsiighhhttttt!!!!! Yeaaaahhhhyeahh!

Anyway, for them to take the github.com nameservers out of the mix they would need a completely separate domain name; would you know to look there?

You can delegate subdomains to other providers, but the NS records are still present in the servers listed in the registrar. So, you'd already need multiple DNS providers.. And you wouldn't have been down. Just sayin. I'm not sure anyone rated a DNS provider of this status getting hit this hard or completely as high enough risk to go through the trouble.

It's easy enough to look at a system and point out all the things you depend on as being a risk. The harder part is deciding which risks are high enough priority to address instead of all the other work to be done.

mnordhoff · on Oct 22, 2016

I mean, some organizations do take precautions against this point of failure and use a separate status domain. Most don't.

https://www.dynstatus.com/ (using Route 53, at least today)

https://www.cloudflarestatus.com/ (using Dyn, ironically)

jcl · on Oct 21, 2016

If it helps any, this link seems to work for me to reach the github status page (requires https certificate override, of course):

https://107.22.212.99/

peterbonney · on Oct 21, 2016

Lots of companies use Twitter for that sort of real-time status reporting, whose own up/down status one would think is sufficiently uncorrelated... unfortunately the internet is complicated.

_xrjp · on Oct 21, 2016

+1 Logical question!

lanna · on Oct 21, 2016

This is what you can do to restore your GitHub access:

    grep github ~/.ssh/known_hosts
    sudo vim /etc/hosts
    sudo killall -HUP mDNSResponder
    ping github.com

lukasm · on Oct 21, 2016

I added

192.30.253.112 github.com

but https://assets-cdn.github.com is failing

EDIT: Use 192.30.253.112 github.com 151.101.24.133 assets-cdn.github.com

or try 8.8.8.8 DNS

lanna · on Oct 21, 2016

Why am I being downvoted for providing useful information? I don't understand HN...

brown9-2 · on Oct 21, 2016

Probably because you say to edit /etc/hosts but not what the content should be.

tedunangst · on Oct 21, 2016

Is it hard to guess? The output of grep isn't a hint?

PeCaN · on Oct 21, 2016

…except they did though, at least if you've sshed into github at some point (which I think nearly everyone has).

oldmanjay · on Oct 21, 2016

If you're attempting to understand the behavior of individual users of HN as a collective, I can assure you that your initial principles are hampering you greatly.

daenney · on Oct 21, 2016

Not sure if people aren't OK with the content but you've posted it twice, which is not really cool with most people or the guidelines.

Also probably the "hijacking top comment" part.

The other occurrence being here: https://news.ycombinator.com/item?id=12760156

razster · on Oct 21, 2016

May not be HN doing the downvotes my friend.

ionbara90 · on Oct 21, 2016

seems like the right thing to do. however the ip address itself won't respond either

afshinmeh · on Oct 21, 2016

Just being curious, why don't you use different DNS servers?

toast0 · on Oct 21, 2016

(I'm not Github, but I work for a Dyn customer) Using multiple DNS providers has technical and organizational issues.

From a technical perspective, if you're doing fancy DNS things like geo targetting, round robin though more A records than you'll return to a query, or healtchecks to fail out ips from your rotations, using multiple providers means they're likely to be out of sync, especially if the provider capabilities don't match. That may not be terrible, because some resolvers are going to cache DNS answers for way longer than the TTL and you have to deal with that anyway. You'll also have to think about what to do when an update applied successfully to one provider, but the second provider failed to apply the update.

From an organizational perspective, most enterprise DNS costs a bunch of money, with volume discounts, so paying for two services, each at half the volume, is going to be significantly more expensive than just one. And you have to deal with two enterprise sales teams bugging you to try their other products, asking for testimonials, etc, bleh.

Also, the enterprise DNS I shopped with all claimed they ran multiple distinct clusters, so they should be covered for software risks that come from shipping the same broken software to all servers and having them all fall over at the same time.

pcai · on Oct 21, 2016

Most services, even if they aren't the size of Github, can't change their DNS provider on a dime.

colanderman · on Oct 21, 2016

It's not a question of switching; you can host your DNS records at multiple providers.

afshinmeh · on Oct 21, 2016

yup, that's what I meant. they can use different DNS providers, e.g. route53 AND dyn

alberts00 · on Oct 21, 2016

Route53 doesn't allow using it as slave DNS. https://forums.aws.amazon.com/thread.jspa?threadID=56011

pyvpx · on Oct 21, 2016

more accurately, they don't support the common standard methodologies for transferring zone data between primary and secondary name servers (like NOTIFY, AXFR, etc).

there is nothing stopping you from having Route53 and $others as NS records for your domains. You just have to make sure they stay consistent. Apparently from the linked discussion, there are people offering scripts and services to do just that.

mugsie · on Oct 21, 2016

Keeping Serial numbers in sync can be basically impossible.

zrail · on Oct 21, 2016

Serial numbers don't matter if you're not using NOTIFY/AXFR.

snug · on Oct 21, 2016

Thats why you should have a different domainname

githubstatus.com instead of status.github.com

You could even through the domain on a free DNS service.

stevekemp · on Oct 21, 2016

Maybe not, but you can store your records in a local place and push to both.

That's one of the reasons I setup a git -> Route53 setup at https://dns-api.com/

3pt14159 · on Oct 21, 2016

If this is consistently a problem why doesn't Github have fallback TLDs that use different DNS providers? Or even just code the site to work with static IPs. I tried the Github IP and it didn't load, but that could be for an unrelated issue.

sly010 · on Oct 21, 2016

> If this is consistently a problem why doesn't Github have fallback TLDs I don't believe this has been consistently a problem in the past. But after today big services probably will have fallback TLDs.

jssjr · on Oct 21, 2016

Another status update from GitHub: "We have migrated to an unaffected DNS provider. Some users may experience problems with cached results as the change propagates."

We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.

Latest status message: https://twitter.com/githubstatus/status/789565863649304576

BlackGuyCoding · on Oct 21, 2016

I love how the White House & GH posted a statement on Twitter.. that we can't access since its down.

lambda · on Oct 21, 2016

Twitter's working fine for me. This attack is affecting different people differently; as a DDOS, attacking a distributed system (DNS) with a lot of redundancy, it's possible for some people to be affected badly while others not affected at all.

I briefly lost access to GitHub, but Twitter has been working fine every time I've checked. Posting status messages in multiple venues helps to ensure that even if one channel is down, people might be able to get status from another channel.

JoshGlazebrook · on Oct 21, 2016

I wish you guys used statuspage or at least allowed email updates for the status of GitHub services.

jssjr · on Sept 22, 2016

Anycast usually implies traffic will be directed to the nearest node advertising that prefix. The GLB directors leverage ECMP which provides the ability to balance flows across many available paths.

bogomipz · on Sept 22, 2016

Anycast and ECMP work together in the context of load balancing. ECMP without Anycasted destination IPs would be pointless for horizontally scaling your LB tier.

What Anycast means is just that multiple hosts share the same IP address - as opposed to unicast. When all the nodes sharing the same IP are on the same subnet "nearest" is kind of irrelevant. So the implication is different.

jssjr · on Sept 22, 2016

Sure. Feel free to call it anycast then. I usually hear anycast routing used in the context of achieving failover or routing flows to the closest server/POP, but there is probably a more formal definition in an RFC that I'll be pointed to shortly. =)

We are using BGP to advertise prefixes for GLB inside the data center to route flows to the directors. In our case all of the nodes are not on the same subnet (or at least not guaranteed to be) which is one of the reasons why we chose to avoid solutions requiring multicast. I expect Joe and Theo will get into more details about that in a future post though.

bogomipz · on Sept 23, 2016

Are you running Quagga or Bird on the director instances then? I'm looking forward to reading more about it.

logicalstack · on Sept 23, 2016

We use Quagga.

jerkstate · on Sept 23, 2016

This is really cool work, I worked with a team that implemented an ECMP hashing scheme using a set of IPs kept alive by VRRP in a previous lifetime, so I have a bit of familiarity with the space and a few questions.

The article says the L4 layer uses ECMP with consistent/rendezvous hashing. is this vendor implemented or implemented by you using openflow or something similar? How does graceful removal at the director layer work? I know you would have to start directing incoming SYNs to another group, but how do you differentiate non-SYN packets that started on the draining group vs. ones that started on the new group?

If you are using L4 fields in the hash, how do you handle ICMP? This approach could break PMTU discovery because a icmp fragmentation needed packet sent in response to a message sent to one of your DSR boxes might hash to a different box, unless considerations have been made.

jssjr · on Sept 22, 2016

A "whole" piece of software would require you to have made the same data center design decisions we've made at GitHub. While some of our choices are opinionated, I think you'll find the GLB architecture adheres to the unix philosophy of individual components each doing one thing well.

Either way, I hope the upcoming engineering-focused posts are interesting and informative! Developing GLB was a challenging engineering project and if open-sourcing it means other companies can benefit from our work and spend more time developing their products, then I'll consider that a success.

jssjr · on May 11, 2016

I think you're referring to the GitHub Engineering blog post [1] about our git storage tier. We [2] store your code on at least 3 servers, which is an improvement in many ways from our previous storage architecture. There are a lot of servers [3] powering things but not the millions it would require to give every customer three dedicated machines. Developing efficient solutions to problems is a requirement (and a fun challenge!) for anything at GitHub's scale.

[1] http://githubengineering.com/introducing-dgit/

[2] I'm a GitHubber. https://github.com/jssjr

[3] https://twitter.com/GitHubEng/status/730429227896463360

z3t4 · on May 12, 2016

It was another post about someone using Github to host "packages".

jssjr · on May 11, 2016

This is really great work. Do you have any plans to open source some (or all) of the code behind Silverton?

jssjr · on Dec 2, 2015

GitHub's physical infrastructure team doesn't dictate what technologies our engineers can run on our hardware. We are interested in providing reliable server resources in a easily consumable manner. If someone wants to provision hardware to run docker containers or similar, that's great!

We may eventually offer higher order infrastructure or platform services internally, but it's not our current focus.

jssjr · on June 25, 2015

It's here at CodeConf!

https://gist.github.com/jssjr/018717ddcb81b76e1829#file-img_...

jssjr · on June 18, 2015

We use collectd extensively and it is wonderful software. Brubeck and collectd do very different jobs.

whost49 · on June 18, 2015

What does Brubeck do that the collectd statsd plugin can't?

https://collectd.org/wiki/index.php/Plugin:StatsD