What I don't understand is: Why is this close coexistence with a untrusted code necessary or even beneficial?
People run code from different security domains on the same L1/L2 cache. Why is that beneficial? Compared to, say, having the kernel flush the cache entirely whenever a context switch goes from one (uid/chroot/...) tuple to another, and only running processes/threads with the same tuple on the same cache at the same time?
Usermode processes call privileged processes often (kernel APIs for example), if you flush the cache every time a context switch occurs, processing is now orders of magnitude slower.
You can absolutely can do this today. It is just slow as heck.
>Usermode processes call privileged processes often (kernel APIs for example), if you flush the cache every time a context switch occurs, processing is now orders of magnitude slower.
Could there be a separate OS cache and a regular user code cache? Would it also speed up things (if cheap enough)?
Sure, but you'd require additional hardware and there's the backwards compatibility question (i.e. it has to work on both the traditional shared cache model AND the new split cache model, or you'd wind up with two versions of x86-64 that were incompatible).
It could work in a backwards compatible manner like a lot of things.
That could technically even be handled transparently by hardwiring caches for usage in Ring 3 or Ring 0 and then introducing additional opcode for explicitly allowing certain addresses to be put into Ring 3 cache when it goes into Ring 0 cache.
PCID already does per-process caching, somewhat, though it's a bit neutered atm.
The good thing is that it only needs to be backwards compatible in the sense that an OS that has no idea about this should at worst have a performance penalty.
The hard point will be the additional hardware. L2 cache is expensive, L1 even moreso.
I've heard that before, it doesn't seem quite credible once you factor out the performance-critical code that tries to minimise context switches, and the API usage that involves a LAN.
It seems to boil down to performance-critical untrusted code that performs many kernel calls.
> I've heard that before, it doesn't seem quite credible
It doesn't seem credible that completely clearing cache and repopulating it every time a privileged context switch occurs would cause performance degradation?
> It seems to boil down to performance-critical untrusted code that performs many kernel calls.
You're conflating the current branch prediction hotfixes with what the person above asked about which was completely clearing cache when the context is switched into a privileged state. Those are two entirely different things.
The current hotfixes try to continue to utilize many performance benefits of speculative execution, while patching out the side-channel attacks from branch prediction. The person above was asking about a complete change in the underlying model we use today, which would promote security at the cost of performance.
Would cause much performance degradation where it matters most.
I run some heavy jobs today. Many CPU seconds each. I don't see that those have to suffer. They can stay married to a CPU core for seconds at a time, uninterrupted, just like today. They perform few system calls, and the ones they do perform never fail any in-kernel check.
The javascript that runs in the browser suffers, sure, but that javascript has other reasons not to need several seconds of CPU.
BTW, that person was me, and you misunderstand the question. Not into a privileged state. Out of. Such that an unprivileged/unstrusted process would see cache contents produced by others.
It slows down everything by default. Performance-critical is relative, an app that could serv 100 req/s might not be optimized because that is sufficient, but if clearing the cache on every context switch reduces it to 70 req/s suddenly needs expensive man-hours for optimization.
If you do that, then on almost every context switch the new process will resume with a cold cache. In the worst case, if you were to send lots of IPC requests to a privileged process then your process could have a cold cache far too often.
Many older generations of ARM (v5 and earlier) used a virtually indexed, virtually tagged (VIVT) cache to reduce latency but this required an L1 flush on every context switch and that was incredibly painful!
People run code from different security domains on the same L1/L2 cache. Why is that beneficial? Compared to, say, having the kernel flush the cache entirely whenever a context switch goes from one (uid/chroot/...) tuple to another, and only running processes/threads with the same tuple on the same cache at the same time?