Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It really depends on the failure mode and the cost of failure. As mentioned by others you can encounter issues in external services which you have no control over and the best you can do in that case is fail gracefully until you're able to deal with the issue. If it's easy to detect failure, and a restart fixes the problem, it can be quite straightforward to set up some monitoring scripts that take care of this for you, and even if it's more complicated than a restart some monitoring can at least notify you by email or SMS. Keeping your tech simple and/or having high test coverage or formal verification can reduce your error rate. Similarly you can introduce fault tolerance into the system with something like Erlang's OTP or monitored containers in an orchestrator (K8s, Docker Swarm, some cloud solution). If failures are expensive you might want to take on staff to deal with them, if the cost is low you might just want to accept occasional downtime (though you'll want to think about how you report that to your users).


+1 for Erlang. Learning how to write OTP apps in Erlang taught me so much about building reliable systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: