Like quantum physics, memory ordering is deeply unintuitive (on platforms like A...

newpavlov · on Oct 26, 2021

Memory ordering gets somewhat easier after you understand that flat memory shared by execution units is a leaky abstraction desperately patched over decades by layer and layers of hardware and software. Memory ordering is one way to represent message passing and synchronization between different cores and RAM. This why I think that "lock-free algorithms" is a misnomer, you still have synchronization, but you simply rely on hardware for it.

dooglius · on Oct 26, 2021

Hardware synchronization has better properties than software locks: it can't deadlock, is reentrant, won't get screwed up by a process holding a lock dying, and is (supposedly) guaranteed to complete in bounded time. I don't think it's unreasonable that the definition of lock-free ("guaranteed system-wide progress") focuses on the high-level behavior of typical software locks even if it ends up calling things that are still locks in some sense "lock-free".

hnfong · on Oct 26, 2021

That’s the benefit you get from having code that defaults to a race condition instead of defaulting to deadlock.

(Which is better? I don’t know.)

uuidgen · on Oct 26, 2021

> "lock-free algorithms" is a misnomer, you still have synchronization,

Lock-free doesn't mean that there is no synchronization. It is a way to synchronize memory access between threads from the start. It means that there is no additional locking to protect access to the shared resource - all read access is valid, from any number of simultaneous write accesses at least one succeeds (which is not true for some other algorithms like network exponential backoff).

Even on x86 the most common instruction you use is LOCK cmpxchg.

gpderetta · on Oct 26, 2021

That's actually a common misconception. Memory ordering, on the majority of common cpus, has nothing to do with interprocessor communication or processor-ram communication. Common memory coherency protocols (I.e. MESI and derivatives) guarantee that all caches have a consistent view of memory.

Usually memory reordering is purely artifact of the way CPUs access their private L1-cache.

yvdriess · on Oct 26, 2021

For the record, this is false. It is conflating memory coherency with consistency.

Nearly everything in a modern processor is a source of reordering, from branch prediction to basically everything in the OoO backend. Any time you leave the core, there's reordering happening in the network. And yes, that includes caches, which involve a heavy amount of inter-core communication. When you issue two successive loads to different cache lines, which one is going to return first?

The OoO backend itself manages hazards and ensures that ld/st instructions are retired in the correct order to maintain the processor's memory consistency model. Software can build on top of that, e.g. with fences, to impose stricter consistency models.

gpderetta · on Oct 26, 2021

Sorry, what is false? You seem to be agreeing with me?

edit: to clarify, I claim that the coherency layer (i.e the L1 cache and beyond), does not introduce any reordering issues, at least for common cpus using variants of MESI.

rendaw · on Oct 26, 2021

Is there anything out there that exposes a better or tighter abstraction? Something not flat?

formerly_proven · on Oct 26, 2021

In practice you want memory reordering to be a thing because that's what allows you to reorder instructions that touch memory (both at compile time, and also at runtime by the processor), which is what enables a large part of the latency hiding that's going on.

devit · on Oct 26, 2021

There are systems like the PS3 SPEs and DSPs where you only have normal access to local on-chip memory and have to explicitly initiate DMA to access external memory.

But that's just bad for running general purpose software that can require more memory than available locally since it means you have to do memory cache management in software which is going to be much slower than letting the hardware do it.

yvdriess · on Oct 26, 2021

A Dataflow architecture ISA would. It's been tried before. But, working out the entire software stack from scratch is a moonshot.

formerly_proven · on Oct 26, 2021

High-performance processors are data flow processors, which infer the data flow graph from the instruction stream using Tomasulo's algorithm.

yvdriess · on Oct 28, 2021

The backend is very dataflow like, yes. But, the dependencies are only tracked within a tight instruction windows, the retirement (esp. of stores) are ordered to implement the memory ordering constraints of the processor and last but not least: the ISA itself is not Dataflow.

pclmulqdq · on Oct 26, 2021

One piece of friction that hurts here is that the C++/Rust and ARM memory models aren't the same, and the consequences of this are unintuitive - compilers and CPUs can both screw with execution ordering.

People who write in C++ should technically _only_ be concerned with the C++ memory model, but x86 has let them be very lax and undisciplined with std::memory_order_relaxed. ARM has some alluring constructs that don't quite fit with the C++ model, which can tempt you to mix and match memory models for performance. All of this means trouble with atomics.

searealist · on Oct 27, 2021

ARMv8 basically exactly mirrors the C++ memory model without any explicit memory orderings (the default on atomics being sequentially consistent).

gpderetta · on Oct 27, 2021

It also added load acquires and store releases to map to the corresponding very common (as they are fast on x86) C++ primitives.