Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.

I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.



Memory ordering gets somewhat easier after you understand that flat memory shared by execution units is a leaky abstraction desperately patched over decades by layer and layers of hardware and software. Memory ordering is one way to represent message passing and synchronization between different cores and RAM. This why I think that "lock-free algorithms" is a misnomer, you still have synchronization, but you simply rely on hardware for it.


Hardware synchronization has better properties than software locks: it can't deadlock, is reentrant, won't get screwed up by a process holding a lock dying, and is (supposedly) guaranteed to complete in bounded time. I don't think it's unreasonable that the definition of lock-free ("guaranteed system-wide progress") focuses on the high-level behavior of typical software locks even if it ends up calling things that are still locks in some sense "lock-free".


That’s the benefit you get from having code that defaults to a race condition instead of defaulting to deadlock.

(Which is better? I don’t know.)


> "lock-free algorithms" is a misnomer, you still have synchronization,

Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).

Even on x86 the most common instruction you use is LOCK cmpxchg.


That's actually a common misconception. Memory ordering, on the majority of common cpus, has nothing to do with interprocessor communication or processor-ram communication. Common memory coherency protocols (I.e. MESI and derivatives) guarantee that all caches have a consistent view of memory.

Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.


For the record, this is false. It is conflating memory coherency with consistency.

Nearly everything in a modern processor is a source of reordering, from branch prediction to basically everything in the OoO backend. Any time you leave the core, there's reordering happening in the network. And yes, that includes caches, which involve a heavy amount of inter-core communication. When you issue two successive loads to different cache lines, which one is going to return first?

The OoO backend itself manages hazards and ensures that ld/st instructions are retired in the correct order to maintain the processor's memory consistency model. Software can build on top of that, e.g. with fences, to impose stricter consistency models.


Sorry, what is false? You seem to be agreeing with me?

edit: to clarify, I claim that the coherency layer (i.e the L1 cache and beyond), does not introduce any reordering issues, at least for common cpus using variants of MESI.


Is there anything out there that exposes a better or tighter abstraction? Something not flat?


In practice you want memory reordering to be a thing because that's what allows you to reorder instructions that touch memory (both at compile time, and also at runtime by the processor), which is what enables a large part of the latency hiding that's going on.


There are systems like the PS3 SPEs and DSPs where you only have normal access to local on-chip memory and have to explicitly initiate DMA to access external memory.

But that's just bad for running general purpose software that can require more memory than available locally since it means you have to do memory cache management in software which is going to be much slower than letting the hardware do it.


A Dataflow architecture ISA would. It's been tried before. But, working out the entire software stack from scratch is a moonshot.


High-performance processors are data flow processors, which infer the data flow graph from the instruction stream using Tomasulo's algorithm.


The backend is very dataflow like, yes. But, the dependencies are only tracked within a tight instruction windows, the retirement (esp. of stores) are ordered to implement the memory ordering constraints of the processor and last but not least: the ISA itself is not Dataflow.


One piece of friction that hurts here is that the C++/Rust and ARM memory models aren't the same, and the consequences of this are unintuitive - compilers and CPUs can both screw with execution ordering.

People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.


ARMv8 basically exactly mirrors the C++ memory model without any explicit memory orderings (the default on atomics being sequentially consistent).


It also added load acquires and store releases to map to the corresponding very common (as they are fast on x86) C++ primitives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: