see section 3.4 here: http://erlang.org/documentation/doc-4.9.1/doc/design_princ...

geofft · on March 23, 2016

Well, okay, so your process crashes, you restart it, it crashes a few more times, then you kill it. What's the advantage there? How does this increase availability, beyond killing it the first time it crashes?

It seems actively worse to allow users to retry requests that are doomed to failure than to put up a fail-whale or similar while the ops team is being paged.

atemerev · on March 23, 2016

Because most production bugs are infrequent (otherwise they would be noticed by testing). They have to be logged and fixed, but not allowed to move the system into inconsistent state. Restart first, fix later.

geofft · on March 23, 2016

Are they? The bug discussed in this comment was extremely deterministic. There's a difference between infrequent in the sense that, across lots of users and lots of requests it happens rarely, and infrequent in the sense that, for one particular use, it only triggers sometimes.

Also, the bug discussed in this article wasn't causing crashes. What would you propose be crashed and restarted in this case?