Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked as an engineer for a very large non tech company (but used a lot of tech, both bought and in-house). We had 100s of teams supporting services, internal apps (web and mobile), external apps (web and mobile), and connections to vendors plus a huge infrastructure in the real world that interconnected to all of this. One time someone changed something in a single data center (I vaguely remember some kind of DNS or routing update) and every single system worldwide failed in a short time. Even after the issue was resolved, it took most of a day and hundreds of people to successfully restart everything, all while our actual business had to continue without pissing off all of our customers. The triage was brutal as to what mattered most.

You can't do this without a lot of people. Sure you could pare it down, maybe improve some architecture, but without a ton of people involved who understand the systems and how they connect, when things might go south they may never return.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: