Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]
I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.
[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.
You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).
Damn, my knowledge of ARM and x86 memory models was limited to "x86 provides a stronger model" but that was it. So much to learn, thanks for the slides.
> Don't try to avoid synchronization primitives, or even invent your own.
Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.
Unless you're the maintainer of the parking_lot crate, you're not "inventing your own". And since parking_lot is AFAIK the second most popular implementation of mutexes and RW locks in Rust (the most popular one being obviously the one in the Rust standard library, which wraps the OS-provided lock implementations), you can assume it's well tested.
I guess it's more about priorities, if you want to build a web app don't reinvent the underlying protocols. If you want to build an embedded system for monitoring room temperature don't invent your own locks.
But if you want to make a lock by all means make a lock, just don't go and reinvent the chip architecture...
People who take on task of writing such library, and develop it to the point it's a language-wide standard, usually know what they're doing (or learn on the job :)
Popularity of such library helps test it thoroughly on many platforms in various conditions, so there's a high chance the bugs will be spotted and fixed.
It's more like "if you don't need to, don't invent your own [x]." People who like to invent [x] are usually smart enough to understand why that warning is there to begin with, and don't tend to argue with it.
not going to argue about core crypto primitives - i have no idea what it takes to be able to contribute meaningfully there.
but I have no formal education. and when I started there were no gatekeepers. I mean there were, the MIT grads got to work on much cooler stuff than everyone else.
since then I've collaborated on chip designs and written a metric shitton of low level software - including microcode and alot of parallel synchronization primitives. and when I work with new grads (incl phds), I don't expect them to know any of that really - and we teach them. some of them get it and some of them don't.
not only do I find your emphasis on the 'proper training' misguided - I think its counter productive. unless you're working with a very rare professor, they aren't going to really have a very adequate notion of what new designs are like at all. most of them have had only the most casual experience with industry. your utility to me as a future systems programmer has alot more to do with your general level of talent and your interest in the topic than whether or not someone made you do dining philosophers.
I doubt that. The number of ARM processors is far greater in reality than in x86 if we clarify it by saying “in operation” rather than historically and these stories will become more common but certainly won't see a “sharp increase”.
This sort of bug only happens when running a multithreaded program (with shared memory) on a multicore processor.
You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.
My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.
Of course. Any such flaws in the Linux kernel or any library used by Android should have been found by now, for example. But the number of ARM processors running developer/server/desktop stacks has been tiny until recently. In my experience, quite a lot of Linux on desktop software fails to even build on non x86_64 machines.
Are you kidding? Arm computers are by far the most common over the past 10 years. Computers are everywhere and servers and home computers account for at most 10% of the market for cpus and microcontrollers.
The dominant Arm core in the world is a Cortex-M (or Cortex-R) which are single-core. They are 99% of the time on a die with far less <512K SRAM, and run an RToS or baremetal.
These outnumber x86+Cortex-A by probably a factor of 1,000.
That's a good question. I guess we would need to compare all the arm multi-core in smartphones & some laptops, vs all the intel laptops/desktops/servers. Hmmm... tough one.
Virtually all modern (last five years, for the low end) smartphones have multicore chips. It's estimated that there are around _7 billion_ smartphones worldwide, with another 1.5bn sold every year. So I don't think it's a very tough one.
Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.
I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.
Memory ordering gets somewhat easier after you understand that flat memory shared by execution units is a leaky abstraction desperately patched over decades by layer and layers of hardware and software. Memory ordering is one way to represent message passing and synchronization between different cores and RAM. This why I think that "lock-free algorithms" is a misnomer, you still have synchronization, but you simply rely on hardware for it.
Hardware synchronization has better properties than software locks: it can't deadlock, is reentrant, won't get screwed up by a process holding a lock dying, and is (supposedly) guaranteed to complete in bounded time. I don't think it's unreasonable that the definition of lock-free ("guaranteed system-wide progress") focuses on the high-level behavior of typical software locks even if it ends up calling things that are still locks in some sense "lock-free".
> "lock-free algorithms" is a misnomer, you still have synchronization,
Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).
Even on x86 the most common instruction you use is LOCK cmpxchg.
That's actually a common misconception. Memory ordering, on the majority of common cpus, has nothing to do with interprocessor communication or processor-ram communication. Common memory coherency protocols (I.e. MESI and derivatives) guarantee that all caches have a consistent view of memory.
Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.
For the record, this is false. It is conflating memory coherency with consistency.
Nearly everything in a modern processor is a source of reordering, from branch prediction to basically everything in the OoO backend. Any time you leave the core, there's reordering happening in the network. And yes, that includes caches, which involve a heavy amount of inter-core communication. When you issue two successive loads to different cache lines, which one is going to return first?
The OoO backend itself manages hazards and ensures that ld/st instructions are retired in the correct order to maintain the processor's memory consistency model. Software can build on top of that, e.g. with fences, to impose stricter consistency models.
Sorry, what is false? You seem to be agreeing with me?
edit: to clarify, I claim that the coherency layer (i.e the L1 cache and beyond), does not introduce any reordering issues, at least for common cpus using variants of MESI.
In practice you want memory reordering to be a thing because that's what allows you to reorder instructions that touch memory (both at compile time, and also at runtime by the processor), which is what enables a large part of the latency hiding that's going on.
There are systems like the PS3 SPEs and DSPs where you only have normal access to local on-chip memory and have to explicitly initiate DMA to access external memory.
But that's just bad for running general purpose software that can require more memory than available locally since it means you have to do memory cache management in software which is going to be much slower than letting the hardware do it.
The backend is very dataflow like, yes. But, the dependencies are only tracked within a tight instruction windows, the retirement (esp. of stores) are ordered to implement the memory ordering constraints of the processor and last but not least: the ISA itself is not Dataflow.
One piece of friction that hurts here is that the C++/Rust and ARM memory models aren't the same, and the consequences of this are unintuitive - compilers and CPUs can both screw with execution ordering.
People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.
I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.
My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.
Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.
That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.
(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)
One other comment: the explanation in [1] is slightly incorrect:
> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.
We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.
Whenever you use 'new', you have to decide what is going to 'own' the newly allocated thing. You'll also have to remember to call 'delete' somewhere.
Using unique_ptr/make_unique() or shared_ptr/make_shared() automates lifetime management (obviates the need for a manual 'delete') and makes the ownership policy explicit. They also have appropriately defined copying behavior. For example:
struct Foo {
// lots of stuff here ...
};
struct A {
Foo* f = new Foo;
~A() { delete f; }
};
struct B {
std::unique_ptr<Foo> f = std::make_unique<Foo>();
// no need to define a dtor; the default dtor is fine
};
For the destructor and the default constructor, compilers will generate basically identical code for both A and B above. If you try to copy a B, the compiler won't let you because unique_ptr isn't copyable. However it won't stop you from copying an A, even though as written (using the default copy ctor) that's almost certainly a mistake and will likely result in a double free in ~A().
Exactly as usefulcat points out: with modern C++, the bulk of object lifetimes should be handled with unique_ptr and/or directly on the stack, so your destructors are automatically called for you (reducing the risk of double frees or memory leaks).
unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.
It's widely accepted you should almost always use some smart pointer like unique_ptr or shared_ptr, and even sometimes the the unpopular weak_ptr depending on your use case.
Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.
Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.
You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.
> "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.
Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.
A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"
I can’t imagine it’s an O2 vs O3 thing unless a compiler enables “fast-math” optimization to allow associativity. Neither clang nor GCC do this (neither does MSVC I think) - optimization levels never silently turn off IEEE754 floating point. I don’t know about ICC but it sounds like they stupidly enable fast math by default to try to win at benchmarks.
Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?
Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.
TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).
there is some amount of truth on what the parent is saying. Ages ago, when x86 only had x87 FP, gcc would program the FPU to use 80 bit precision even when dealing with doubles. The excess precision meant that GCC could not implement IEEE math correctly even without fast-math. Forcing the storing of intermediate values into memory via volatile variables was a partial solution to this problem.
MSVC configures the FPU to use 64 bit precision which means that double words fine, but it has no 80 bit long double and float still suffer from excess precision.
Kind of, but that still shouldn't have impacted Kahan summation, which only cares about associativity, and extended precision doesn't impact that. They would just end up getting more numerically accurate results on x87.
I did tests on Kahan summation recently on my macbook pro and -O3 defeated the algorithm while -O2 did not. Declaring the below variables as volatile restored error compensation with -O3.
Sounds like a compiler bug to me. Can you file a bug to clang with a reduced standalone test (or I can do it for you if you share the standalone test).
Here is a complete simplified Kahan summation test and indeed it works with -O3 but fails with -Ofast. There must have been something else going on in my real program at -O3. However the original point that 'volatile' can be a workaround for some optimization problems is still valid (you may want the rest of your program to benefit from -Ofast without breaking certain parts).
Changing the three kahan_* variables to volatile makes this work (slowly) with -Ofast.
#include <stdio.h>
int main(int argc, char **argv) {
int i;
double sample, sum;
double kahan_y, kahan_t, kahan_c;
// initial values
sum=0.0;
sample=1.0; // start with "large" value
for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values
// Kahan summation algorithm
kahan_y=sample - kahan_c;
kahan_t=sum + kahan_y;
kahan_c=(kahan_t - sum) - kahan_y;
sum=kahan_t;
// pre-load next small value
sample=1.0E-20;
}
printf("sum: %.15f\n", sum);
}
Correct. `-Ofast` claim to fame is it enables `-ffast-math` which is why it has huge warning signs around it in the documentation. `-ffast-math` turns on associativity which is problematic for Kahan summation. Rather than sprinkling in volatiles which pessimizes the compiler to no end, I would recommend annotating the problematic function to turn off associativity [1][2].
That way the compiler applies all the optimizations it can but only turns off associative math. This should work on Clang & GCC & be net faster in all cases.
This is what I mean by "If you're sprinkling volatile around, you probably aren't doing what you want" and are just cargo culting bad advice.
I hope this isn't the actual "real" code, because you've got undefined behavior before you even have to worry about the associativity optimizations. There's an uninitialized read of 'kahan_c' on the first loop iteration.
I think their point is you only need compiler barriers not actual barrier instructions on x86. volatile in practice has been the de facto way to get the effect of a compiler memory barrier for a long time even though it's not the best way to do it nowadays. The original purpose of it is literally preventing the compiler from getting rid of loads and stores and reordering them which is exactly what is needed when implementing a lockless FIFO. As long as all the stores and loads (including the actual FIFO payload) are volatile, it will work (volatile loads and stores are guaranteed to not be reordered with each other). After that the x86 guarantees about not reordering are very strong. Really the best argument against volatile for this kind of thing is actually the opposite of your point, volatile is too strong. It prevents more reordering than you actually want. Acquire/release semantics are less strong and give the compiler more flexibility.
volatile does not work as a memory barrier neither in theory nor in practice. On gcc for example you need an explicit additional compiler barrier before a volatile store and after a volatile load to implement the expected release/acquire semantics on x86.
See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).
Volatile only prevents reordering of volatile statements (and IO), not all load and stores.
The point is that the bug is unexploitable on x86 because although the source code may have a bug, on x86 it gets compiled to machine code that does not. That's the thing with undefined behaviour, sometimes it does work exactly as you expect, which can make it so tricky to nail down.
Right. The challenge is written incorrectly on purpose, otherwise the code isn't vulnerable. The use of volatile is a bit of a misdirection for the CTF players, since you're right that it's a common misconception that volatile acts like a barrier.
> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.
I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.
From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.
Another place it's meaningful to use `volatile` is in benchmarking and testing: to either ensure that a block of code is run despite not having any side effects, or to ensure that a block of code that should not be run is still compiled and emitted to binary.
But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.
Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.
IMO lock-free MP or MC algorithms are harder to get right than SPSC structures (atomics for shared memory, queues for messaging, triple buffers for tear-free shared memory). But even SPSC algorithms can be tricky; I've found the same (theoretical) ordering error in three separate Rust implementations of triple buffering (one of them mine), written by people who've already learned the ordering rules (which I caught with Loom). And initially learning to reason about memory ordering is a major upfront challenge too.
I particularly like lock-free (wait-free?) SPSC queues because they're (relatively) easy to get right, and are extremely useful for buffering in embedded systems. I end up with something like this on almost every project:
One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.
An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].
This isn't a new lock-free algorithm. Single-reader, single-write FIFOs are one of the oldest approaches around.
They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.
For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:
Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.
Rust doesn't catch memory ordering errors, which can result in behavioral bugs in safe Rust and data races and memory unsafety in unsafe Rust. But Loom is an excellent tool for catching ordering errors, though its UnsafeCell API differs from std's (and worse yet, some people report Loom returns false positives/negatives in some cases: https://github.com/tokio-rs/loom/issues/180, possibly https://github.com/tokio-rs/loom/issues/166).
> which can result in behavioral bugs in safe Rust
For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.
However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.
It doesn't catch all of them. But data-races on plain memory access are impossible in safe rust.
And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).
My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.
> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.
Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.
A hundred times is not that much except for really cold code paths. It’s probably in the billions if not more and I have to imagine that software level effects typically swamp HW-level effects here. That’s why you see software typically having a performance deviation no greater than ~5-10% unless you’re running microbenchmarks.
The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.
tl;dr: just use std::atomic.
[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity
The purpose of abolishing volatile isn't so much to reinforce that it's not intended for this sort of threading nonsense (indeed on Windows the MSVC guarantees mean it almost is intended for this sort of nonsense) but to make it explicit that "volatile variables" were never really a thing anyway by abolishing the volatile qualifier on variables.
And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.
With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing. It means nothing, it's gibberish, but you can do it and people do. They presumably thought it meant something and since it doesn't they were wrong. So, deprecate this and maybe they'll go read up on the subject.
You can also in C++ declare things that clearly aren't in the least bit volatile, as volatile anyway. C++ has volatile member variables, volatile member functions, volatile parameters... Any code that seems to rely on this probably doesn't do what the people who wrote it thought it does, run away.
>With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing.
It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that. Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'. Does that also not make any sense to you? If it does, can you explain why you believe they are different?
Volatile member functions and parameters make no sense, but volatile member variables most certainly do. And there is considerable pushback in the C++ community because this is a significant loss of compatibility with various C-headers used frequently in embedded applications. I wouldn't be surprised if the deprecated features will be reinstated in the language in the end.
> It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that.
If you just want normal variables, write a normal variable, J F. Bastien's change doesn't have any consequences for your normal variables.
> Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'
No. That's the post-increment operator so i++ has the same value as i did before the increment, whereas i = i + 1 has the value of i after the increment.
And you might say "Oh, but I never use those values". Good. J.F. Bastien's change forbids that for volatiles too, we're on the same page.
What you apparently want is to do two volatile operations, one read and one write, sandwiching an arbitrary integer operation. You can do that explicitly of course, Bastien's change doesn't alter that - but when you write it out, doubt appears on the horizon. If these are two operations, can't I race with somebody else such that they modify this thing before I write my modified version back and then I overwrite their changes?
And that doubt should be there. But what we know from asking people is that the compound operators falsely reassure them. That's why they were deprecated and why it's sad that there's enthusiasm to bring them back.
It's true that the relevant committee is now likely to propose this be changed in C++ 23 (or before as an errata)
But the face saving is on the part of embedded C++ proponents who'd rather never fix the C++ language than risk that the mundane C code cheap vendors actually ship can't be counted as "C++ support" because it's no longer valid C++.
That's the core of the argument in P2327. Not that this isn't a bad idea (it's obviously a bad idea) or that we shouldn't do better (alternatives to C++ already do better) but that this is the status quo and C++ can't expect to improve upon that. A sad state of affairs.
P2327's dismissal of the problem caused by UART1−>UCSR0B |= (1<<UCSZ01 ); comes down to the usual Real Programmer stance, surely every single embedded programmer will arrange "by construction" that the problem can't happen. No actual examples were examined to see if embedded developers reliably do that, which seems unlikely - we're just invited to imagine that they're all inerrant.
Not sure what will actually happen, but you could easily allow volatile variables with if they have extern “C” linkage for compatibility with C headers, while deprecating it elsewhere.
That's not really what extern "C" does though. It doesn't change the language rules or the parsing or anything, it only changes how symbols are represented in the object file. Extern "C" means they are mangled using C rules (that mostly involves adding an underscore); otherwise it's gonna be C++ rules (and loads more information needs to be encoded).
Some people argue for a similar mechanism, something like 'language "C++20" { .. }', that would allow a program to opt in to changes that would otherwise be breaking changes; mostly new keywords. However, changing actual language rules in such a block would be tricky, to say the least.
To draw together the two answers here to the original question.
1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.
2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.
Are you sure the translators don't insert code necessary to maintain ordering? I would be shocked if most threaded code works when you throw out the x86 memory model. Managed runtimes like .NET definitely generate code for each target designed to maintain the correct memory model.
> You can also select multi-core settings, as shown here... These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core.
zamadatix's interprets this as Microsoft saying that by default, Windows on ARM runs x86 apps without x86 TSO, and turns on extra memory barriers using per-app compatibility settings. But if an app needs TSO but isn't in Windows's database, it will crash or silently corrupt data.
Sure, but Windows on ARM has to run on many ARM processors, not a specific one designed by MS. They could detect if the processor has non-standard TSO support and use that when running an x86 app, but they still have to do something to run the x86 app on a standard ARM processor.
emulator would have a zero issue, if it's a direct transfer for assembly (not an emulator), it'd need either hardware support - e.g. apple chips, or memory barriers.
The differences between arm and x86 are known for 15y+, there is nothing new about it. Also concurrency support is one of the major benefits of languages with proper memory model - java started it with JMM[0]
The differences are very known, it's just still an open problem how to address running code for a stronger memory model on systems with a weaker one at as high as possible performance performance without explicit hardware support (like Apple's choice of a TSO config bit). Your compiled binary for TSO has erased any LoadLoad, LoadStore, and StoreStore barriers, and the emulator has to divine them. The heuristics there are still fraught with peril.
The JVM absolutely did some great work walking this path, both in defining a memory model in the first place, and supporting that model on weak and strong hardware memory models, but the JMM was specifically designed to be able to run cleanly on WMO platforms to begin with (early Sparc), so they don't face a lot of the same problems discussed here.
Any emulator that wants to be remotely performance competitive will do dynamic translation (i.e JIT). In fact ahead-of-time translation is not really feasible.
Memory models and JVM are not really relevant when discussing running binaries for a different architecture.
The memory models are relevant as the translation/JIT/whatever has to take them in consideration. TSO is well known and it's also well known how arm weaker memory model needs memory barriers to emulate TSO.
If there is a JIT I'd expect to be able to add a read barriers, on memory location allocated by another thread - incl. allocating bits in the pointers and masking them off on each dereference. If any block appears to be shared - the code that allocated it would need to be recompiled with memory store-store barriers; the reading part would need load-load and so on. There are quite a few ways to deal with the case, aside the obvious - make the hardware compatible.
If in end it's not an easy feat to make up for the stronger memory model correctness, yet correctness should be a prime goal of an 'emulator'
It is relevant as an example of how to write a JIT for a specific memory model that will run on a different architecture with a different memory model. AKA it is a known issue that has been successfully dealt with for quite a while now.
Ah well, back in the day, prior JMM, there was no notion of memory models at all, and no language had one; Hence, I referenced the original paper that started it all. The point was that it happened long time ago, and there is nothing new about the current case.
I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.
[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.