A bug that doesn’t exist on x86: Exploiting an ARM-only race condition

anyfoo · on Oct 26, 2021

Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]

I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.

[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.

gpderetta · on Oct 26, 2021

But Peterson's algorithm requires explicit memory barriers even on x86, it doesn't seem the best example to show the difference.

anyfoo · on Oct 26, 2021

Here are my slides from back then: https://reinference.net/mp-talk.pdf

You made me wonder, because I definitely remember using Peterson's Algorithm, so I went back to my slides and turns out: I first showed the problem with x86, then indeed added an MFENCE at the right place, and then showed how that was not enough for ARM. So the point back then was to show how weaker memory models can bite you with the example of x86, and then to show how it can still bite you on ARM with its even weaker model (ARMv7 at that time, and C11 atomics aren't mentioned yet either, but their old OS-specific support is).

destring · on Oct 27, 2021

Damn, my knowledge of ARM and x86 memory models was limited to "x86 provides a stronger model" but that was it. So much to learn, thanks for the slides.

gpderetta · on Oct 26, 2021

Oh, right, yes, ARM additionally needs a release barrier on the unlock path.

mwcampbell · on Oct 26, 2021

> Don't try to avoid synchronization primitives, or even invent your own.

Makes me wonder if it's really a good idea in most cases to use, for example, the Rust parking_lot crate, which reimplements mutexes and RW locks. Besides the speed boost for uncontended RW locks, particularly on x64 Linux, what I really like about parking_lot is that a write lock can be downgraded to a read lock without letting go of the lock. But maybe I'd be better off sticking with the tried-and-true OS-provided lock implementations and finding another way to work around the lack of a downgrade option.

cesarb · on Oct 26, 2021

Unless you're the maintainer of the parking_lot crate, you're not "inventing your own". And since parking_lot is AFAIK the second most popular implementation of mutexes and RW locks in Rust (the most popular one being obviously the one in the Rust standard library, which wraps the OS-provided lock implementations), you can assume it's well tested.

codeflo · on Oct 26, 2021

Everything about which people tell you to “not invent your own” must be invented by someone.

twodai · on Oct 26, 2021

I guess it's more about priorities, if you want to build a web app don't reinvent the underlying protocols. If you want to build an embedded system for monitoring room temperature don't invent your own locks.

But if you want to make a lock by all means make a lock, just don't go and reinvent the chip architecture...

pornel · on Oct 26, 2021

You're taking a saying too literally.

People who take on task of writing such library, and develop it to the point it's a language-wide standard, usually know what they're doing (or learn on the job :)

Popularity of such library helps test it thoroughly on many platforms in various conditions, so there's a high chance the bugs will be spotted and fixed.

oriki · on Oct 26, 2021

I mean, that's too literal of an interpretation.

It's more like "if you don't need to, don't invent your own [x]." People who like to invent [x] are usually smart enough to understand why that warning is there to begin with, and don't tend to argue with it.

JulianMorrison · on Oct 26, 2021

It's not "don't invent it".

It's "be competent before you invent it, because it's hard". And if you aren't, then let someone who is do the inventing.

convolvatron · on Oct 26, 2021

the best way to _get_ competent is to try

so yes - invent your own synchronization primitives. please.

just dont believe they are correct without being serious about trying to prove they are. and dont hold up your whole project for self-enrichment.

but try to layer as much in as you can.

developers these days are so productive, until they fall down and cant get up. and then they are completely useless.

JulianMorrison · on Oct 27, 2021

Disagreed. There are some problems, and low level hardware is one of them, where you need understanding and not just experience.

convolvatron · on Oct 27, 2021

so what would qualify one to do this kind of work? a graduate class in parallel programming?

JulianMorrison · on Oct 28, 2021

Depends which thing.

- For low level hardware grovelling, you need a sufficient understanding of the workings of modern chip internals.

- For crypto, another "don't invent your own" thing, probably a specialised degree in cryptography and a lot of practical experience.

The bar is simply higher than "I have access to a programming language and think my idea would work".

convolvatron · on Oct 29, 2021

not going to argue about core crypto primitives - i have no idea what it takes to be able to contribute meaningfully there.

but I have no formal education. and when I started there were no gatekeepers. I mean there were, the MIT grads got to work on much cooler stuff than everyone else.

since then I've collaborated on chip designs and written a metric shitton of low level software - including microcode and alot of parallel synchronization primitives. and when I work with new grads (incl phds), I don't expect them to know any of that really - and we teach them. some of them get it and some of them don't.

not only do I find your emphasis on the 'proper training' misguided - I think its counter productive. unless you're working with a very rare professor, they aren't going to really have a very adequate notion of what new designs are like at all. most of them have had only the most casual experience with industry. your utility to me as a future systems programmer has alot more to do with your general level of talent and your interest in the topic than whether or not someone made you do dining philosophers.

gpderetta · on Oct 27, 2021

Don't invent anything unless your job is to invent it. Of course you might not know what your job is until you are in the middle of it.

api · on Oct 26, 2021

In that case you should never run any software. Go live as a hunter-gatherer in the wilderness.

stjohnswarts · on Oct 26, 2021

I doubt that. The number of ARM processors is far greater in reality than in x86 if we clarify it by saying “in operation” rather than historically and these stories will become more common but certainly won't see a “sharp increase”.

codeflo · on Oct 26, 2021

This sort of bug only happens when running a multithreaded program (with shared memory) on a multicore processor.

You do need both for the problem to happen: Without shared memory, there’s nothing to exploit. And with a single core only, you get time-sliced multithreading, which orders all operations.

My point is, that combination was a lot rarer in ARM land before people started doing serious server or desktop computing with those chips.

retrac · on Oct 26, 2021

Of course. Any such flaws in the Linux kernel or any library used by Android should have been found by now, for example. But the number of ARM processors running developer/server/desktop stacks has been tiny until recently. In my experience, quite a lot of Linux on desktop software fails to even build on non x86_64 machines.

stjohnswarts · on Oct 26, 2021

Are you kidding? Arm computers are by far the most common over the past 10 years. Computers are everywhere and servers and home computers account for at most 10% of the market for cpus and microcontrollers.

masklinn · on Oct 26, 2021

The vast majority of those are not running multitheaded workloads written by complete randos.

SavantIdiot · on Oct 26, 2021

The dominant Arm core in the world is a Cortex-M (or Cortex-R) which are single-core. They are 99% of the time on a die with far less <512K SRAM, and run an RToS or baremetal.

These outnumber x86+Cortex-A by probably a factor of 1,000.

rsynnott · on Oct 26, 2021

There are almost certainly more multi-core ARM chips than x86 chips around, too, tho.

SavantIdiot · on Oct 26, 2021

That's a good question. I guess we would need to compare all the arm multi-core in smartphones & some laptops, vs all the intel laptops/desktops/servers. Hmmm... tough one.

rsynnott · on Oct 26, 2021

Virtually all modern (last five years, for the low end) smartphones have multicore chips. It's estimated that there are around _7 billion_ smartphones worldwide, with another 1.5bn sold every year. So I don't think it's a very tough one.

beebmam · on Oct 26, 2021

Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.

I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.

newpavlov · on Oct 26, 2021

Memory ordering gets somewhat easier after you understand that flat memory shared by execution units is a leaky abstraction desperately patched over decades by layer and layers of hardware and software. Memory ordering is one way to represent message passing and synchronization between different cores and RAM. This why I think that "lock-free algorithms" is a misnomer, you still have synchronization, but you simply rely on hardware for it.

dooglius · on Oct 26, 2021

Hardware synchronization has better properties than software locks: it can't deadlock, is reentrant, won't get screwed up by a process holding a lock dying, and is (supposedly) guaranteed to complete in bounded time. I don't think it's unreasonable that the definition of lock-free ("guaranteed system-wide progress") focuses on the high-level behavior of typical software locks even if it ends up calling things that are still locks in some sense "lock-free".

hnfong · on Oct 26, 2021

That’s the benefit you get from having code that defaults to a race condition instead of defaulting to deadlock.

(Which is better? I don’t know.)

uuidgen · on Oct 26, 2021

> "lock-free algorithms" is a misnomer, you still have synchronization,

Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).

Even on x86 the most common instruction you use is LOCK cmpxchg.

gpderetta · on Oct 26, 2021

That's actually a common misconception. Memory ordering, on the majority of common cpus, has nothing to do with interprocessor communication or processor-ram communication. Common memory coherency protocols (I.e. MESI and derivatives) guarantee that all caches have a consistent view of memory.

Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.

yvdriess · on Oct 26, 2021

For the record, this is false. It is conflating memory coherency with consistency.

Nearly everything in a modern processor is a source of reordering, from branch prediction to basically everything in the OoO backend. Any time you leave the core, there's reordering happening in the network. And yes, that includes caches, which involve a heavy amount of inter-core communication. When you issue two successive loads to different cache lines, which one is going to return first?

The OoO backend itself manages hazards and ensures that ld/st instructions are retired in the correct order to maintain the processor's memory consistency model. Software can build on top of that, e.g. with fences, to impose stricter consistency models.

gpderetta · on Oct 26, 2021

Sorry, what is false? You seem to be agreeing with me?

edit: to clarify, I claim that the coherency layer (i.e the L1 cache and beyond), does not introduce any reordering issues, at least for common cpus using variants of MESI.

rendaw · on Oct 26, 2021

Is there anything out there that exposes a better or tighter abstraction? Something not flat?

formerly_proven · on Oct 26, 2021

In practice you want memory reordering to be a thing because that's what allows you to reorder instructions that touch memory (both at compile time, and also at runtime by the processor), which is what enables a large part of the latency hiding that's going on.

devit · on Oct 26, 2021

There are systems like the PS3 SPEs and DSPs where you only have normal access to local on-chip memory and have to explicitly initiate DMA to access external memory.

But that's just bad for running general purpose software that can require more memory than available locally since it means you have to do memory cache management in software which is going to be much slower than letting the hardware do it.

yvdriess · on Oct 26, 2021

A Dataflow architecture ISA would. It's been tried before. But, working out the entire software stack from scratch is a moonshot.

formerly_proven · on Oct 26, 2021

High-performance processors are data flow processors, which infer the data flow graph from the instruction stream using Tomasulo's algorithm.

yvdriess · on Oct 28, 2021

The backend is very dataflow like, yes. But, the dependencies are only tracked within a tight instruction windows, the retirement (esp. of stores) are ordered to implement the memory ordering constraints of the processor and last but not least: the ISA itself is not Dataflow.

pclmulqdq · on Oct 26, 2021

One piece of friction that hurts here is that the C++/Rust and ARM memory models aren't the same, and the consequences of this are unintuitive - compilers and CPUs can both screw with execution ordering.

People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.

searealist · on Oct 27, 2021

ARMv8 basically exactly mirrors the C++ memory model without any explicit memory orderings (the default on atomics being sequentially consistent).

gpderetta · on Oct 27, 2021

It also added load acquires and store releases to map to the corresponding very common (as they are fast on x86) C++ primitives.

vitus · on Oct 26, 2021

I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.

My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.

Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.

That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.

(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)

One other comment: the explanation in [1] is slightly incorrect:

> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.

We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.

[0] https://github.com/stong/how-to-exploit-a-double-free#0-inte...

[1] https://github.com/stong/how-to-exploit-a-double-free#2-rece...

PaulDavisThe1st · on Oct 26, 2021

> Turns out that I was missing something:

Indeed. It's not safe under x86 either.

stong1 · on Oct 26, 2021

Great points. I made some minor edits to address that and clarify some things. Thanks!

bunnythefifth · on Oct 26, 2021

> IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell

As a naive practitioner of modern C++, I'd love it if you could elaborate on this.

usefulcat · on Oct 26, 2021

Whenever you use 'new', you have to decide what is going to 'own' the newly allocated thing. You'll also have to remember to call 'delete' somewhere.

Using unique_ptr/make_unique() or shared_ptr/make_shared() automates lifetime management (obviates the need for a manual 'delete') and makes the ownership policy explicit. They also have appropriately defined copying behavior. For example:

    struct Foo {
        // lots of stuff here ...
    };

    struct A {
        Foo* f = new Foo;
        ~A() { delete f; }
    };
    
    struct B {
        std::unique_ptr<Foo> f = std::make_unique<Foo>();
        // no need to define a dtor; the default dtor is fine
    };

For the destructor and the default constructor, compilers will generate basically identical code for both A and B above. If you try to copy a B, the compiler won't let you because unique_ptr isn't copyable. However it won't stop you from copying an A, even though as written (using the default copy ctor) that's almost certainly a mistake and will likely result in a double free in ~A().

vitus · on Oct 27, 2021

Exactly as usefulcat points out: with modern C++, the bulk of object lifetimes should be handled with unique_ptr and/or directly on the stack, so your destructors are automatically called for you (reducing the risk of double frees or memory leaks).

unique_ptr forces you to think about your dependencies and when objects can / should be cleaned up.

ddlutz · on Oct 26, 2021

It's widely accepted you should almost always use some smart pointer like unique_ptr or shared_ptr, and even sometimes the the unpopular weak_ptr depending on your use case.

PaulDavisThe1st · on Oct 26, 2021

Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.

Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.

kloch · on Oct 26, 2021

> "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.

A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Of course -O3 might not be any faster anyway but that's another topic.

vlovich123 · on Oct 26, 2021

I can’t imagine it’s an O2 vs O3 thing unless a compiler enables “fast-math” optimization to allow associativity. Neither clang nor GCC do this (neither does MSVC I think) - optimization levels never silently turn off IEEE754 floating point. I don’t know about ICC but it sounds like they stupidly enable fast math by default to try to win at benchmarks.

Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?

Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.

TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).

gpderetta · on Oct 26, 2021

there is some amount of truth on what the parent is saying. Ages ago, when x86 only had x87 FP, gcc would program the FPU to use 80 bit precision even when dealing with doubles. The excess precision meant that GCC could not implement IEEE math correctly even without fast-math. Forcing the storing of intermediate values into memory via volatile variables was a partial solution to this problem.

MSVC configures the FPU to use 64 bit precision which means that double words fine, but it has no 80 bit long double and float still suffer from excess precision.

SSE avoid all these problems.

vlovich123 · on Oct 26, 2021

Kind of, but that still shouldn't have impacted Kahan summation, which only cares about associativity, and extended precision doesn't impact that. They would just end up getting more numerically accurate results on x87.

kloch · on Oct 26, 2021

I did tests on Kahan summation recently on my macbook pro and -O3 defeated the algorithm while -O2 did not. Declaring the below variables as volatile restored error compensation with -O3.

The relevant code is:

          kahan_y=g_sample_z - kahan_c;
          kahan_t=g_sample_z_sum + kahan_y;
          kahan_c=(kahan_t - g_sample_z_sum) - kahan_y;
          g_sample_z_sum=kahan_t;

(this is in an inner loop where a new g_sample_z is calculated and then added to a running g_sample_z_sum with this snippet)

vlovich123 · on Oct 26, 2021

Sounds like a compiler bug to me. Can you file a bug to clang with a reduced standalone test (or I can do it for you if you share the standalone test).

kloch · on Oct 26, 2021

Here is a complete simplified Kahan summation test and indeed it works with -O3 but fails with -Ofast. There must have been something else going on in my real program at -O3. However the original point that 'volatile' can be a workaround for some optimization problems is still valid (you may want the rest of your program to benefit from -Ofast without breaking certain parts).

Changing the three kahan_* variables to volatile makes this work (slowly) with -Ofast.

  #include <stdio.h>

  int main(int argc, char **argv) {
    int i;
    double sample, sum;
    double kahan_y, kahan_t, kahan_c;

    // initial values
    sum=0.0;
    sample=1.0; // start with "large" value

    for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values
      // Kahan summation algorithm
      kahan_y=sample - kahan_c;
      kahan_t=sum + kahan_y;
      kahan_c=(kahan_t - sum) - kahan_y;
      sum=kahan_t;

      // pre-load next small value
      sample=1.0E-20;
    }
    printf("sum: %.15f\n", sum);
  }

vlovich123 · on Oct 26, 2021

Correct. `-Ofast` claim to fame is it enables `-ffast-math` which is why it has huge warning signs around it in the documentation. `-ffast-math` turns on associativity which is problematic for Kahan summation. Rather than sprinkling in volatiles which pessimizes the compiler to no end, I would recommend annotating the problematic function to turn off associativity [1][2].

Something like:

    [[gnu::optimize("no-associative-math")]]
    double kahanSummation() {
      ...
    }

That way the compiler applies all the optimizations it can but only turns off associative math. This should work on Clang & GCC & be net faster in all cases.

This is what I mean by "If you're sprinkling volatile around, you probably aren't doing what you want" and are just cargo culting bad advice.

[1] https://stackoverflow.com/questions/26266820/in-clang-how-do... [2] https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attrib...

hermitdev · on Oct 26, 2021

I hope this isn't the actual "real" code, because you've got undefined behavior before you even have to worry about the associativity optimizations. There's an uninitialized read of 'kahan_c' on the first loop iteration.

dataangel · on Oct 26, 2021

I think their point is you only need compiler barriers not actual barrier instructions on x86. volatile in practice has been the de facto way to get the effect of a compiler memory barrier for a long time even though it's not the best way to do it nowadays. The original purpose of it is literally preventing the compiler from getting rid of loads and stores and reordering them which is exactly what is needed when implementing a lockless FIFO. As long as all the stores and loads (including the actual FIFO payload) are volatile, it will work (volatile loads and stores are guaranteed to not be reordered with each other). After that the x86 guarantees about not reordering are very strong. Really the best argument against volatile for this kind of thing is actually the opposite of your point, volatile is too strong. It prevents more reordering than you actually want. Acquire/release semantics are less strong and give the compiler more flexibility.

gpderetta · on Oct 27, 2021

volatile does not work as a memory barrier neither in theory nor in practice. On gcc for example you need an explicit additional compiler barrier before a volatile store and after a volatile load to implement the expected release/acquire semantics on x86.

See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).

Volatile only prevents reordering of volatile statements (and IO), not all load and stores.

[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...

hvdijk · on Oct 26, 2021

The point is that the bug is unexploitable on x86 because although the source code may have a bug, on x86 it gets compiled to machine code that does not. That's the thing with undefined behaviour, sometimes it does work exactly as you expect, which can make it so tricky to nail down.

stong1 · on Oct 26, 2021

Right. The challenge is written incorrectly on purpose, otherwise the code isn't vulnerable. The use of volatile is a bit of a misdirection for the CTF players, since you're right that it's a common misconception that volatile acts like a barrier.

> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.

From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.

gpderetta · on Oct 26, 2021

Without compiler fences in the right place [1] GCC and clang can miscompile the code even on x86. Doesn't mean they will of course.

[1] see the linux kernel implementation of load acquire and store release on x86 for example.

scatters · on Oct 26, 2021

Another place it's meaningful to use `volatile` is in benchmarking and testing: to either ensure that a block of code is run despite not having any side effects, or to ensure that a block of code that should not be run is still compiled and emitted to binary.

But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.

leeter · on Oct 26, 2021

Well their use is definitely UB as it creates data races. Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z

gpderetta · on Oct 26, 2021

As noted elsethread, the code is indeed wrong even on x86, although you only need compiler fences there.

pcwalton · on Oct 26, 2021

Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.

nyanpasu64 · on Oct 26, 2021

IMO lock-free MP or MC algorithms are harder to get right than SPSC structures (atomics for shared memory, queues for messaging, triple buffers for tear-free shared memory). But even SPSC algorithms can be tricky; I've found the same (theoretical) ordering error in three separate Rust implementations of triple buffering (one of them mine), written by people who've already learned the ordering rules (which I caught with Loom). And initially learning to reason about memory ordering is a major upfront challenge too.

ohazi · on Oct 26, 2021

I particularly like lock-free (wait-free?) SPSC queues because they're (relatively) easy to get right, and are extremely useful for buffering in embedded systems. I end up with something like this on almost every project:

One side of the queue is a peripheral like a serial port that needs to be fed/drained like clockwork to avoid losing data or glitching (e.g. via interrupts or DMA), and the other side is usually software running on the main thread, that wants to be able to work at its own pace and also go to sleep sometimes.

An SPSC queue fits this use-case nicely. James Munns has a fancy one written in Rust [1], and I have a ~100 line C template [2].

[1] https://github.com/jamesmunns/bbqueue

[2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...

reitzensteinm · on Oct 26, 2021

I'd be interested in knowing the details of the error!

nyanpasu64 · on Oct 26, 2021

https://github.com/HadrienG2/triple-buffer/issues/14

platinumrad · on Oct 26, 2021

There's no new invention in here. Just an (intentional) misuse of "volatile".

xxs · on Oct 26, 2021

There are tons of lock-free algorithms, both node based and array backed up. Lock-free is notoriously easier on garbage collector set-ups, of course.

PaulDavisThe1st · on Oct 26, 2021

This isn't a new lock-free algorithm. Single-reader, single-write FIFOs are one of the oldest approaches around.

They have to be tweaked when execution isn't guaranteed (by using memory barriers). TFA is about an exploit based on code that hasn't added the required memory barriers.

reitzensteinm · on Oct 26, 2021

For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:

https://www.reitzen.com/post/temporal-fuzzing-01/ https://www.reitzen.com/post/temporal-fuzzing-02/

Next step are some lock free queues, although I haven't gotten around to publishing them!

Azsy · on Oct 26, 2021

Have i told you about our lord and savior Rust?

Anyways, https://github.com/tokio-rs/loom is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.

nyanpasu64 · on Oct 26, 2021

Rust doesn't catch memory ordering errors, which can result in behavioral bugs in safe Rust and data races and memory unsafety in unsafe Rust. But Loom is an excellent tool for catching ordering errors, though its UnsafeCell API differs from std's (and worse yet, some people report Loom returns false positives/negatives in some cases: https://github.com/tokio-rs/loom/issues/180, possibly https://github.com/tokio-rs/loom/issues/166).

tialaramex · on Oct 26, 2021

> which can result in behavioral bugs in safe Rust

For example, Rust doesn't have any way to know that your chosen lock-free algorithm relies on Acquire-release semantics to perform as intended, and so if you write safe Rust to implement it with Relaxed ordering, it will compile, and run, and on x86-64 it will even work just fine because the cheap behaviour on x86-64 has Acquire-release semantics anyway. But on ARM your program doesn't work because ARM really does have a Relaxed mode and without Acquire-release what you've got is not the clever lock-free algorithm you intended after all.

However, if you don't even understand what Ordering is, and just try to implement the naive algorithm in Rust without Atomic operations that take an Ordering, Rust won't compile your program at all because it could race. So this way you are at least confronted with the fact that it's time to learn about Ordering if you want to implement this algorithm and if you pick Relaxed you can keep the resulting (safe) mess you made.

CodesInChaos · on Oct 26, 2021

It doesn't catch all of them. But data-races on plain memory access are impossible in safe rust.

And atomics force you to specify an ordering on every access, which helps both the writer (forced to think about which ordering they need) and reviewer (by communicating intent).

Fiahil · on Oct 26, 2021

I think it's fixable, the main reactor is what matters. You can add or remove as many synchronisation primitive as you like.

Other tooling, like Jepsen, will interact with your program at a higher level.

0xfaded · on Oct 26, 2021

My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.

silisili · on Oct 26, 2021

> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.

Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.

krylon · on Oct 26, 2021

Heraclitus, mumbling into his beard: "Told you so!"

SCNR

vlovich123 · on Oct 26, 2021

A hundred times is not that much except for really cold code paths. It’s probably in the billions if not more and I have to imagine that software level effects typically swamp HW-level effects here. That’s why you see software typically having a performance deviation no greater than ~5-10% unless you’re running microbenchmarks.

agalunar · on Oct 26, 2021

Great write-up!

There may be a typo in section 3:

> It will happily retire instruction 6 before instruction 5.

If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").

colejohnson66 · on Oct 26, 2021

You are correct. The retire unit ensures that all micro ops are retired in order

stong1 · on Oct 26, 2021

Nice catch. I fixed it. I should have said "execute" rather than "retire".

gpderetta · on Oct 26, 2021

The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.

tl;dr: just use std::atomic.

[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity

secondcoming · on Oct 26, 2021

There is a proposal (possibly accepted) to deprecate 'volatile' in C++.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...

tialaramex · on Oct 26, 2021

Yes P1152 was taken for C++ 20.

The purpose of abolishing volatile isn't so much to reinforce that it's not intended for this sort of threading nonsense (indeed on Windows the MSVC guarantees mean it almost is intended for this sort of nonsense) but to make it explicit that "volatile variables" were never really a thing anyway by abolishing the volatile qualifier on variables.

The thing your hardware can actually do is almost exactly: https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and https://doc.rust-lang.org/core/ptr/fn.write_volatile.html

And sure enough that's equivalent to what is proposed for C++ although not in just this one paper.

With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing. It means nothing, it's gibberish, but you can do it and people do. They presumably thought it meant something and since it doesn't they were wrong. So, deprecate this and maybe they'll go read up on the subject.

You can also in C++ declare things that clearly aren't in the least bit volatile, as volatile anyway. C++ has volatile member variables, volatile member functions, volatile parameters... Any code that seems to rely on this probably doesn't do what the people who wrote it thought it does, run away.

agent327 · on Oct 26, 2021

>With "volatile variables" you can use compound assignment operators on the variable. What does that even mean? Nothing.

It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that. Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'. Does that also not make any sense to you? If it does, can you explain why you believe they are different?

Volatile member functions and parameters make no sense, but volatile member variables most certainly do. And there is considerable pushback in the C++ community because this is a significant loss of compatibility with various C-headers used frequently in embedded applications. I wouldn't be surprised if the deprecated features will be reinstated in the language in the end.

tialaramex · on Oct 26, 2021

> It means exactly the same thing as on a normal variable, and it boggles the mind that people somehow not understand that.

If you just want normal variables, write a normal variable, J F. Bastien's change doesn't have any consequences for your normal variables.

> Given 'volatile int i', 'i++' means the exact same thing as 'i = i + 1'

No. That's the post-increment operator so i++ has the same value as i did before the increment, whereas i = i + 1 has the value of i after the increment.

And you might say "Oh, but I never use those values". Good. J.F. Bastien's change forbids that for volatiles too, we're on the same page.

What you apparently want is to do two volatile operations, one read and one write, sandwiching an arbitrary integer operation. You can do that explicitly of course, Bastien's change doesn't alter that - but when you write it out, doubt appears on the horizon. If these are two operations, can't I race with somebody else such that they modify this thing before I write my modified version back and then I overwrite their changes?

And that doubt should be there. But what we know from asking people is that the compound operators falsely reassure them. That's why they were deprecated and why it's sad that there's enthusiasm to bring them back.

gpderetta · on Oct 26, 2021

Yes, it will likely be reverted [1] for bitwise ops, while keeping += and ++ deprecated (to save face someone would say).

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p232...

tialaramex · on Oct 26, 2021

It's true that the relevant committee is now likely to propose this be changed in C++ 23 (or before as an errata)

But the face saving is on the part of embedded C++ proponents who'd rather never fix the C++ language than risk that the mundane C code cheap vendors actually ship can't be counted as "C++ support" because it's no longer valid C++.

That's the core of the argument in P2327. Not that this isn't a bad idea (it's obviously a bad idea) or that we shouldn't do better (alternatives to C++ already do better) but that this is the status quo and C++ can't expect to improve upon that. A sad state of affairs.

P2327's dismissal of the problem caused by UART1−>UCSR0B |= (1<<UCSZ01 ); comes down to the usual Real Programmer stance, surely every single embedded programmer will arrange "by construction" that the problem can't happen. No actual examples were examined to see if embedded developers reliably do that, which seems unlikely - we're just invited to imagine that they're all inerrant.

vlovich123 · on Oct 26, 2021

Not sure what will actually happen, but you could easily allow volatile variables with if they have extern “C” linkage for compatibility with C headers, while deprecating it elsewhere.

agent327 · on Oct 26, 2021

That's not really what extern "C" does though. It doesn't change the language rules or the parsing or anything, it only changes how symbols are represented in the object file. Extern "C" means they are mangled using C rules (that mostly involves adding an underscore); otherwise it's gonna be C++ rules (and loads more information needs to be encoded).

Some people argue for a similar mechanism, something like 'language "C++20" { .. }', that would allow a program to opt in to changes that would otherwise be breaking changes; mostly new keywords. However, changing actual language rules in such a block would be tricky, to say the least.

mhh__ · on Oct 26, 2021

volatile primitives is how D does volatile as well.

I do sort of miss having a basic volatile (although I can write my own type somewhat effectively) just for benchmarking's sake sometimes.

half-kh-hacker · on Oct 26, 2021

this slaps. I always see perfect blue a few places above us!

nyanpasu64 · on Oct 26, 2021

Context for downvoters: "perfect blue" is the CTF group writing this article, and "a few places" means CTF team rankings in competitions.

cookiewill · on Oct 26, 2021

Is it normal for the .got.plt section to be writable rather than read-only?

amelius · on Oct 26, 2021

Does the race condition exist when emulating x86 on Apple M1?

mmwelt · on Oct 26, 2021

Apple added hardware support for x86 memory semantics.

https://news.ycombinator.com/item?id=28731534

https://mobile.twitter.com/ErrataRob/status/1331735383193903...

saagarjha · on Oct 26, 2021

No. Rosetta emulates TSO correctly.

addaon · on Oct 26, 2021

To draw together the two answers here to the original question.

1) Emulating an ISA includes emulating its memory model. As saagarjha says, this means that Rosetta 2 must (and does) correctly implement total store ordering.

2) There are various ways to implement this. For emulators that include a binary translation layer (that is, that translate x86 opcodes into a sequence of ARM opcodes), one route is to generate the appropriate ARM memory barriers as part of the translation. Even with optimization to reduce the number of necessary barriers, though, this is expensive. Instead, as mmwelt mentions, Apple took an unusual route here. The Apple Silicon MMU can be configured on a per-page basis to use either the relaxed ARM memory model or the TSO x86 memory model. There is a performance cost at the hardware level for using TSO, and there is a cost in silicon area for supporting both; but from the point of view of Rosetta 2, all it has to do is mark x86-accessed pages as TSO and the hardware takes care of the details, no software memory barriers needed.

saagarjha · on Oct 26, 2021

TSO is a per-thread flag; it’s implemented as a write to a MSR on context switch to switch it on or off.

gpderetta · on Oct 27, 2021

Right, is it POWER that has a per page TSO flag?

saagarjha · on Oct 27, 2021

I actually do not know, sadly.

sydthrowaway · on Oct 26, 2021

Any good references on low level details on ARMv8+?

im3w1l · on Oct 26, 2021

And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.

kevingadd · on Oct 26, 2021

Are you sure the translators don't insert code necessary to maintain ordering? I would be shocked if most threaded code works when you throw out the x86 memory model. Managed runtimes like .NET definitely generate code for each target designed to maintain the correct memory model.

nyanpasu64 · on Oct 26, 2021

https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

> You can also select multi-core settings, as shown here... These settings change the number of memory barriers used to synchronize memory accesses between cores in apps during emulation. Fast is the default mode, but the strict and very strict options will increase the number of barriers. This slows down the app, but reduces the risk of app errors. The single-core option removes all barriers but forces all app threads to run on a single core.

https://news.ycombinator.com/item?id=28732273

zamadatix's interprets this as Microsoft saying that by default, Windows on ARM runs x86 apps without x86 TSO, and turns on extra memory barriers using per-app compatibility settings. But if an app needs TSO but isn't in Windows's database, it will crash or silently corrupt data.

pmuderoc · on Oct 26, 2021

They better do, but then, how would an automatic translator know that this is a "release semantics" atomic store operation?

Because on x86 it is, no special barriers or instructions necessary.

mov [shared_data], 1

mov [release_flag], 1

my123 · on Oct 26, 2021

It’s pessimistic and converts over a lot of memory accesses to RCpc or atomics.

(on ARMv8.0 where you don’t have those, barriers are used more)

TSO pessimization is the only way to make the thing work at a translation time cost that isn’t too high.

gpderetta · on Oct 26, 2021

Or you support TSO directly on your cpu like Apple does on M1.

tsimionescu · on Oct 26, 2021

Sure, but Windows on ARM has to run on many ARM processors, not a specific one designed by MS. They could detect if the processor has non-standard TSO support and use that when running an x86 app, but they still have to do something to run the x86 app on a standard ARM processor.

my123 · on Oct 26, 2021

Maintaining the memory model guarantees is what causes the steep cost in performance when using x86 apps on Windows on Arm.

That said, heuristics are used to speed it up. I would recommend not sharing values in the stack between threads for synchronisation for example.

xxs · on Oct 26, 2021

Normally the code should have all the needed memory fences as if running on DEC Alpha, e.g. linux does that, and the compilers omit the unneeded ones.

monocasa · on Oct 26, 2021

And since the compiler omitted it on x86, an x86 emulator doesn't have access to where they're required as seen by the compiler.

xxs · on Oct 26, 2021

emulator would have a zero issue, if it's a direct transfer for assembly (not an emulator), it'd need either hardware support - e.g. apple chips, or memory barriers.

The differences between arm and x86 are known for 15y+, there is nothing new about it. Also concurrency support is one of the major benefits of languages with proper memory model - java started it with JMM[0]

[0]: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedL...

monocasa · on Oct 26, 2021

The differences are very known, it's just still an open problem how to address running code for a stronger memory model on systems with a weaker one at as high as possible performance performance without explicit hardware support (like Apple's choice of a TSO config bit). Your compiled binary for TSO has erased any LoadLoad, LoadStore, and StoreStore barriers, and the emulator has to divine them. The heuristics there are still fraught with peril.

The JVM absolutely did some great work walking this path, both in defining a memory model in the first place, and supporting that model on weak and strong hardware memory models, but the JMM was specifically designed to be able to run cleanly on WMO platforms to begin with (early Sparc), so they don't face a lot of the same problems discussed here.

gpderetta · on Oct 26, 2021

Any emulator that wants to be remotely performance competitive will do dynamic translation (i.e JIT). In fact ahead-of-time translation is not really feasible.

Memory models and JVM are not really relevant when discussing running binaries for a different architecture.

xxs · on Oct 26, 2021

The memory models are relevant as the translation/JIT/whatever has to take them in consideration. TSO is well known and it's also well known how arm weaker memory model needs memory barriers to emulate TSO.

If there is a JIT I'd expect to be able to add a read barriers, on memory location allocated by another thread - incl. allocating bits in the pointers and masking them off on each dereference. If any block appears to be shared - the code that allocated it would need to be recompiled with memory store-store barriers; the reading part would need load-load and so on. There are quite a few ways to deal with the case, aside the obvious - make the hardware compatible.

If in end it's not an easy feat to make up for the stronger memory model correctness, yet correctness should be a prime goal of an 'emulator'

stonemetal12 · on Oct 26, 2021

It is relevant as an example of how to write a JIT for a specific memory model that will run on a different architecture with a different memory model. AKA it is a known issue that has been successfully dealt with for quite a while now.

secondcoming · on Oct 26, 2021

Doesn't the JVM define its own memory model?

gpderetta · on Oct 26, 2021

Sure, but how's that relevant when discussing running x86 binaries on ARM?

xxs · on Oct 26, 2021

Ah well, back in the day, prior JMM, there was no notion of memory models at all, and no language had one; Hence, I referenced the original paper that started it all. The point was that it happened long time ago, and there is nothing new about the current case.

secondcoming · on Oct 26, 2021

Sorry, I thought you were talking about running JVM binaries on different architectures

belter · on Oct 26, 2021

Now I am worried. Do you have a reference please?

im3w1l · on Oct 26, 2021

Best I could find. It's not a great reference because it doesn't give any details but it does prove that it's a thing. https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

drcongo · on Oct 26, 2021

Nice try Intel.