XSA-156: x86: CPU lockup during exception delivery

userbinator · on Dec 7, 2015

In other words, the triple-fault[1] is broken? This looks like a bad hardware bug, a really bad one. AFAIK on real hardware it does what it should, i.e. causes the CPU to reset. The fact that OSs in the past have relied on triple-faulting to cause a reset[2] makes this all the more unusual. Then again, I suppose no one has really tried to run MS-DOS and related software in Xen...

the vulnerability can be avoided altogether if the guest kernel is controlled by the host rather than guest administrator

That sort of defeats the point of using a VM, doesn't it?

[1] https://en.wikipedia.org/wiki/Triple_fault

[2] http://www.rcollins.org/Productivity/TripleFault.html

Sanddancer · on Dec 7, 2015

It doesn't defeat the purpose. VMs are still useful for aggregating services that need their own OS, but it would be a waste to give them their own box.

zvrba · on Dec 7, 2015

Triple fault? It seems that double fault exception (abort) should be triggered first.

userbinator · on Dec 7, 2015

I believe the phrase "it is architecturally specified that these would be delivered sequentially" means that #DF doesn't always occur, depending on what the two exception types were; this goes back to the 80386:

http://intel80386.com/386htm/s09_08.htm

That has always been there, but I guess the wording is a bit unclear/the edge case where a "benign exception" occurs while handling another one was never really considered. If I had time I'd try these scenarios on real hardware to see if double or triple-fault happens, or if the CPU does get stuck in a loop.

The real problem might not be this edge-case itself, if real hardware can also get into an infinite loop (after all, some process running in a VM can easily execute one of those); it's the fact that the host loses control of the virtualised CPU.

yuhong · on Dec 7, 2015

Yep, I think they said the problem is the CPU hanging in a infinite loop in microcode with not even SMIs being delivered.

comex · on Dec 7, 2015

Two things I'm wondering:

- What kind of performance impact does the workaround have?

- Will Intel or AMD be able to fix this in microcode (by making it do the right thing if an external interrupt or NMI arrives)?

kogepathic · on Dec 7, 2015

> Will Intel or AMD be able to fix this in microcode (by making it do the right thing if an external interrupt or NMI arrives)?

What I want to know is: does this affect other hypervisors as well? If this is a bug related to the CPU, why haven't we heard from KVM, VMWare, etc about it?

I can't believe Xen basically just said "run PVM or get pwned"

yuhong · on Dec 7, 2015

If the hypervisor already intercepts #DB and #AC, they are not affected.

MS has also released a fix: https://technet.microsoft.com/en-us/library/security/3108638

KVM: https://lkml.org/lkml/2015/11/10/214

lsc · on Dec 7, 2015

>I can't believe Xen basically just said "run PVM or get pwned"

They didn't say that. scroll down to the "RESOLUTION" section. they include a patch that presumably solves the problem in HVM mode. They are just mentioning (as they should) that if you are running PV mode, this particular problem isn't a problem.

joosters · on Dec 7, 2015

Well, the worst case is a hardware lockup, so it's a DOS rather than a data-theft exploit. It's still a massive issue, but cloud companies like Amazon could stamp out customers who abuse it.

OTOH, if malware starts to deliberately trigger this, everyone loses.

yuhong · on Dec 7, 2015

#DB is usually only triggered when debugging, and #AC is even rarer. Most likely it would be fixed by the microcode triggering another exception instead, with the last resort being a triple fault obviously.

nnx · on Dec 7, 2015

This looks bad.

Did AWS comment on this yet?

To my limited understanding of the advisory, Xen's recommended mitigation would be for AWS to "convert" all EC2 HVM instances to PVM?

Is that even possible?

yuhong · on Dec 7, 2015

They most likely already patched it.