Impact illustration: > [...] the contents of the entire memory to be read over t...

formerly_proven · 2025-05-13T17:12:36 1747156356

Prepare for another dive maneuver in the benchmarks department I guess.

tsukikage · 2025-05-13T23:27:36 1747178856

We need software and hardware to cooperate on this. Specifically, threads from different security contexts shouldn't get assigned to the same core. If we guarantee this, the fences/flushes/other clearing of shared state can be limited to kernel calls and process lifetime events, leaving all the benefits of caching and speculative execution on the table for things actually doing heavy lifting without worrying about side channel leaks.

tankenmate · 2025-05-14T06:51:17 1747205477

I get you, but devs struggle to configure nginx to serve their overflowing cauldrons of 3rd party npm modules of witches incantations. Getting them securely design and develop security labelled cgroup based micro (nano?) compute services for inferencing text of various security levels is beyond even 95% of coders. I'd posit that it would be a herculean effort even for 1% devs.

Just fix the processors?

tsukikage · 2025-05-14T09:28:12 1747214892

It's not a "just" if the fix cripples performance; it's a tradeoff. It is forced to hurt everything everywhere because the processor alone has no mechanism to determine when the mitigation is actually required and when it is not. It is 2025 and security is part of our world; we need to bake it right into how we think about processor/software interaction instead of attempting to bolt it on after the fact. We learned that lesson for internet facing software decades ago. It's about time we learned it here as well.

tankenmate · 2025-05-14T11:30:05 1747222205

Is the juice worth the squeeze? Not everything needs Orange Book (DoD 5200.28-STD) Class B1 systems.

immibis · 2025-05-14T08:04:30 1747209870

how will this prevent JavaScript from leaking my password manager database?

cenamus · 2025-05-13T17:14:58 1747156498

And if not, why did they introduce severe bugs for a tiny performance improvement?

bloppe · 2025-05-13T17:27:22 1747157242

It's not tiny. Speculative execution usually makes code run 10-50% faster, depending on how many branches there are

bee_rider · 2025-05-13T17:40:23 1747158023

Yeah… folks who think this is just some easy to avoid thing should go look around and find the processor without branch prediction that they want to use.

On the bright side, they will get to enjoy a much better music scene, because they’ll be visiting the 90’s.

yencabulator · 2025-05-14T14:29:10 1747232950

> Does Branch Privilege Injection affect non-Intel CPUs?

> No. Our analysis has not found any issues on the evaluated AMD and ARM systems.

wbl · 2025-05-14T02:59:26 1747191566

IBM Stretch had branch prediction. Pentium in the early 1990s had it. It's a huge win with any pipelining.

titzer · 2025-05-13T20:52:09 1747169529

That's a vast underestimate. Putting in lfence before every branch is on the order of 10X slowdown.

grumbelbart2 · 2025-05-14T07:56:35 1747209395

There is of course a slight chicken-egg-thing here: If there was no (dynamic) branch prediction, we (as in compilers) would emit different code that is faster for non-predicting CPUs (and presumably slower for predicting CPUs). That would mitigate a bit of that 10x.

anyfoo · 2025-05-14T22:12:50 1747260770

A bit. I think we've shown time and time again that letting the compiler do what the CPU is doing doesn't work out, most recently with Itanium.

thesz · 2025-05-14T16:29:24 1747240164

The issue is with indirect branches. Most branches are direct ones.

cenamus · 2025-05-14T17:20:25 1747243225

Of course I know that.

But if the fix for this bug (how many security holes have ther been now in Intel CPUs? 10?) brings only a couple % performance loss, like most of the them so far, how can you even justify that at all? Isn't there a fundamental issue in there?

autoexec · 2025-05-13T23:08:26 1747177706

How much improvement would there still be if we weren't so lazy when it comes to writing software. If we were working to get as much performance out of the machines as possible and avoiding useless bloat instead of just counting on the hardware to be "good enough" to handle the slowness with some grace.

umanwizard · 2025-05-13T21:49:15 1747172955

A modern processor pipeline is dozens of cycles deep. Without branch prediction, we would need to know the next instruction at all times before beginning to fetch it. So we couldn’t begin fetching anything until the current instruction is decoded and we know it’s not a branch or jump. Even more seriously, if it is a branch, we would need to stall the pipeline and not do anything until the instruction finishes executing and we know whether it’s taken or not (possibly dozens of cycles later, or hundreds if it depends on a memory access). Stalling for so many cycles on every branch is totally incompatible with any kind of modern performance. If you want a processor that works this way, buy a microcontroller.

tremon · 2025-05-14T14:20:41 1747232441

But branch prediction doesn't necessarily need complicated logic. If I remember correctly (it's been 20 years since I read any papers on it), the simple heuristic "all relative branches backwards are taken, but forward and absolute branches are not" could achieve 70-80% performance of the state-of-the-art implementations back then.

anyfoo · 2025-05-14T22:14:32 1747260872

Do you mean overall or localized to branch prediction? Assuming all of that is true, you're talking about a 20-30% performance hit?

superblas · 2025-05-14T00:07:01 1747181221

> If you want a processor that works this way, buy a microcontroller.

The ARM Cortex-R5F and Cortex-M7, to name a few, have branch predictors as well, for what it’s worth ;)

jeffbee · 2025-05-14T01:51:17 1747187477

You can still have a static branch predictor. That has surprisingly good coverage. I'm not saying this is a great idea, just pointing it out.