Decently built servers don't fail that often. If you're going to run a couple thousand servers, you'll want them to be bare metal, because VM overhead is real.
If your system is designed properly, it's not unreasonable to have a node out of service for a couple hours while it's repaired. Nodes which don't have database state can often wait until business hours to be repaired, if there are enough nodes that you can have a few down.
If your system is designed properly, it's not unreasonable to have a node out of service for a couple hours while it's repaired. Nodes which don't have database state can often wait until business hours to be repaired, if there are enough nodes that you can have a few down.