This post mortem has me thinking about the best way to handle the situation in which you can't SSH into your server. The OP decided to trigger a kernel panic/restart on OOM errors, but I have a couple of concerns about this approach:
* If memory serves correctly, if your system runs out of memory, shouldn't the scheduler kill processes that are using too much memory? If this is the case, the system should recover from the OOM error and no restart should be needed.
* OOM errors aren't the only way to get a system into a state where you cannot SSH into a system. It would be great to have a more general solution.
* Even if you do restart, unless you had some kind of performance monitoring enabled, the system is no longer in the high-memory state so it will take a bit of digging to determine the root cause. If OOM errors are logged to syslog or something, I guess this isn't a big deal.
I suppose the best fail-safe solution is to ensure you always have one of the following:
* physical access to the system
* a way to access the console indirectly (something like VSphere comes to mind)
* Services like linode allow you to restart your system remotely, which would have been useful in this scenario
* In linux-land, there's an OOM killer (http://linux-mm.org/OOM_Killer) that would have started taking processes out. You have to exhaust swap for it to really take effect, and once you hit swap, your entire machine suddenly becomes hugely IO bound - in shared or virtual hosting environments, this usually makes the machine totally unresponsive.
* I've never seen any sort of virtual hosting service without either a remote console or a remote reboot. Usually both.
* If memory serves correctly, if your system runs out of memory, shouldn't the scheduler kill processes that are using too much memory? If this is the case, the system should recover from the OOM error and no restart should be needed.
* OOM errors aren't the only way to get a system into a state where you cannot SSH into a system. It would be great to have a more general solution.
* Even if you do restart, unless you had some kind of performance monitoring enabled, the system is no longer in the high-memory state so it will take a bit of digging to determine the root cause. If OOM errors are logged to syslog or something, I guess this isn't a big deal.
I suppose the best fail-safe solution is to ensure you always have one of the following:
* physical access to the system
* a way to access the console indirectly (something like VSphere comes to mind)
* Services like linode allow you to restart your system remotely, which would have been useful in this scenario