How does two way isolation work? How do you prevent the host kernel (which presu...

jbott · on Dec 6, 2023

It looks like the host kernel is not in full control – there is a EL2-level hypervisor, pKVM [1] that is actually the highest-privilege domain. This is pretty similar to the Xen architecture [1] where the dom0 linux os in charge of managing the machine is running as a guest of the hypervisor.

1. https://source.android.com/docs/core/virtualization/architec... 2. https://wiki.xenproject.org/wiki/Xen_Project_Software_Overvi...

pjmlp · on Dec 6, 2023

Commonly known as type 1 hypervisor architecture, by opposition to type 2 hypervisor, which run as OS services.

Ironically the revenge of microkernels, as most cloud workloads run on type 1 hypervisors.

bonzini · on Dec 6, 2023

No, KVM is also a type 1 hypervisor but it doesn't attempt (with the exception of pKVM and of hardware protection features like SEV, neither of which is routinely used by cloud workloads) to protect the guest from a malicious host.

Vogtinator · on Dec 6, 2023

KVM is a type 2 hypervisor as the "Dom 0" kernel has full HW access. Other guests are obviously isolated as configured and are like special processes to userspace.

It gets a bit blurry on AArch64 without and with VHE (Virtual Host Extensions) as without VHE (< ARMv8.1) the kernel runs in EL1 ("kernel mode") most of the time and escalates to EL2 ("hypervisor mode") only when needed, but with VHE it runs at EL2 all the time. (ref. https://lwn.net/Articles/650524/)

bonzini · on Dec 6, 2023

No, "type 2" is defined by Goldberg's thesis as "The VMM runs on an extended host [53,75], under the host operating system", where:

* VMM is treated as synonymous with hypervisor

* "Extended host" is defined as "A pseudo-machine [99], also called an extended machine [53] or a user machine [75], is a composite machine produced through a combination of hardware and software, in which the machine's apparent architecture has been changed slightly to make the machine more convenient to use. Typically these architectural changes have taken the form of removing I/O channels and devices, and adding system calls to perform I/O and and other operations"

In other words, type 1 ("bare machine hypervisor") runs in supervisor mode and type 2 runs in user mode. QEMU running in dynamic binary translation mode is a type 2 VMM.

KVM runs on a bare machine, but it delegates some services to a less privileged component such as QEMU or crosvm or Firecracker. This is not a type 2 hypervisor, it is a type 1 hypervisor that follows security principles such as privilege separation.

pjmlp · on Dec 6, 2023

Where in my comment did I refer explicitly to KVM feature set, or that it is used by cloud vendors?

bonzini · on Dec 6, 2023

KVM is pretty much the only hypervisor that cloud vendors use these days.

So it's true that "most cloud workloads run on type 1 hypervisors" (KVM is one) but not that most cloud vendors/workloads run on microkernel-like hypervisors, with the exception of Azure.

pjmlp · on Dec 6, 2023

You definitly didn't understood my commment.

bonzini · on Dec 6, 2023

Then can you explain how cloud workloads is the revenge of the microkernel, since there is exactly 1 major cloud provider that uses a microkernel-like hypervisor?

pjmlp · on Dec 6, 2023

By not running monolithic kernels on top of bare metal, rather virtualized, or even better with nested virtualization, thus throwing out the door all the supposedly performance advantages in the usual monolithic vs microkernel flamewar discussions, regarding context switching.

Additionally to make it to the next level, they run endless amount of container workloads.

fgoesbrrr · on Dec 6, 2023

I don't know about Android, but AMD CPUs support encrypting regions of physical memory with different keys which are accessible only to one particular VM running, but also not accessible to the host:

AMD Secure Encrypted Virtualization (SEV)

https://www.amd.com/en/developer/sev.html

fooker · on Dec 7, 2023

Does every memory read/write have to go through decryption/encryption or just the paging mechanism?

transpute · on Dec 6, 2023

The architecture pattern is similar to Bromium/HP AX + Type 2 μXen on x86, https://www.youtube.com/watch?v=bNVe2y34dnM (2018), which ships on HP business PCs.

Bare metal runs a tiny L0 hypervisor making use of hardware support for nested virtualization. In turn, the L0 can run an L1 hypervisor, e.g. KVM or "host" OS, or minimal L1 VMs that are peers to the L1 "host"-guest of L0.

Google pKVM-for-Arm tech talk (2022), hopefully x86 will follow, https://www.youtube.com/watch?v=9npebeVFbFw

haltist · on Dec 6, 2023

You can inspect their hypervisor code and verify the host kernel can not access the VM after creation but if you are running as root then you can obviously inspect whatever process is under host/hypervisor control.

anonuser123456 · on Dec 6, 2023

You make the various hardware modules security context aware. You then give the host a separate security context from guests. You need a trusted hypervisor to bootstrap it.

ignoramous · on Dec 6, 2023

Protected KVM on Arm64: A Technical Deep Dive, Quentin Perret (KVM Forum 2022), https://www.youtube.com/watch?v=9npebeVFbFw (first 5m)

ReactiveJelly · on Dec 6, 2023

It must be relying on a TPM somehow, right? That isn't possible with any normal software VM

transpute · on Dec 6, 2023

This eschews hardware-based TEE (like TrustZone or TPM) in favor of hardware support for nested virtualization, plus open-source L0 hypervisor code.

In the best case future, this will offer security properties based on a small OSS attack surface, rather than black box TEE firmware.