In other words, the triple-fault[1] is broken? This looks like a bad hardware bug, a really bad one. AFAIK on real hardware it does what it should, i.e. causes the CPU to reset. The fact that OSs in the past have relied on triple-faulting to cause a reset[2] makes this all the more unusual. Then again, I suppose no one has really tried to run MS-DOS and related software in Xen...
the vulnerability can be avoided altogether if
the guest kernel is controlled by the host rather than guest
administrator
That sort of defeats the point of using a VM, doesn't it?
It doesn't defeat the purpose. VMs are still useful for aggregating services that need their own OS, but it would be a waste to give them their own box.
I believe the phrase "it is architecturally specified that these would be delivered sequentially" means that #DF doesn't always occur, depending on what the two exception types were; this goes back to the 80386:
That has always been there, but I guess the wording is a bit unclear/the edge case where a "benign exception" occurs while handling another one was never really considered. If I had time I'd try these scenarios on real hardware to see if double or triple-fault happens, or if the CPU does get stuck in a loop.
The real problem might not be this edge-case itself, if real hardware can also get into an infinite loop (after all, some process running in a VM can easily execute one of those); it's the fact that the host loses control of the virtualised CPU.
> Will Intel or AMD be able to fix this in microcode (by making it do the right thing if an external interrupt or NMI arrives)?
What I want to know is: does this affect other hypervisors as well? If this is a bug related to the CPU, why haven't we heard from KVM, VMWare, etc about it?
I can't believe Xen basically just said "run PVM or get pwned"
>I can't believe Xen basically just said "run PVM or get pwned"
They didn't say that. scroll down to the "RESOLUTION" section. they include a patch that presumably solves the problem in HVM mode. They are just mentioning (as they should) that if you are running PV mode, this particular problem isn't a problem.
Well, the worst case is a hardware lockup, so it's a DOS rather than a data-theft exploit. It's still a massive issue, but cloud companies like Amazon could stamp out customers who abuse it.
OTOH, if malware starts to deliberately trigger this, everyone loses.
#DB is usually only triggered when debugging, and #AC is even rarer. Most likely it would be fixed by the microcode triggering another exception instead, with the last resort being a triple fault obviously.
the vulnerability can be avoided altogether if the guest kernel is controlled by the host rather than guest administrator
That sort of defeats the point of using a VM, doesn't it?
[1] https://en.wikipedia.org/wiki/Triple_fault
[2] http://www.rcollins.org/Productivity/TripleFault.html