> Reason: locks that have the ability to put the thread to sleep on a queue must...

pizlonator · 2024-10-02T21:55:20 1727906120

You do need a fence in the unlock path though (at least a release fence).

I think the issue is that if you ask the CPU to just store something (like in a spin lock), whether or not there’s a fence, it’s an operation with limited data flow dependencies so it’s easy for the CPU to execute. Even the fence can be handled using wacky speculation tricks.

But if you want to do something like, “store this value but only if the old value satisfies some predicate”, then there’s a load and the whole thing is dependent on the load. So you’re asking the CPU to load, then run a predicate, then store, and for that to be fenced, and atomic.

Strictly more work. I don’t think there’s any trick to make it faster than just the store release.

ot · 2024-10-02T23:13:08 1727910788

> You do need a fence in the unlock path though (at least a release fence).

Well yes but on x86 that comes for free. The overhead of the full fence brought in by lock cmpxchg or lock xchg is in the order of ~10ns, which for an uncontended lock means that a mutex is almost 2x as slow as a spinlock.

A load acquire + store release would be a couple of ns (assuming everything in L1 etc...)

gpderetta · 2024-10-03T02:32:44 1727922764

As far as I know it is a fundamental limitation. You need to release the mutex, then check that there were no waiters in this order not to miss wakeups. As the mutex release must be globally visible before the load, release ordering on the mutex is not sufficient as the load could be reordered before the unlock; hence you need a StoreLoad fence which is always the most expensive barrier.

Consider the implementation of Dekker's algorithm for example.

gpderetta · 2024-10-03T11:42:17 1727955737

As a more practical argument, let's suppose x86 had a an atomic CAS that guarantees that the store is released but the load is relaxed (unlike other x86 normal loads, but like non-temporal loads, it has no implied LoadLoad or LoadStore like other x86 loads).

This relaxed-load-CAS, coupled with some form of control or data dependency would be sufficient to implement your mutex. But would such a CAS be significantly cheaper than the existing CAS? If it was, you would be able to approximate the strong CAS semantics by adding lfence+sfence after the relaxed CAS. These fences are cheap, so if strong CAS was possible to be implemented this way with significant improvements, intel would have already done it.

Finally, it is practically possible to implement the sequence store(A);#StoreLoad;load (B) without an explicit fence by using the colocation trick: have A and B be adjacent in memory (on the same cacheline), store to A, then do a wider load on both A and B. Intel does not give multithread guarantees on this, but my understanding is that it works: the wider load fails to be store-forwarded and stalls the pipeline waiting for the previous store to be flushed. In practice this costs about as much as a fence, so in addition to being undefined, is not cheaper.

gpderetta · 2024-10-03T08:33:52 1727944432

Also this paper might be relevant: Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated [1]

[1] https://www.cs.bgu.ac.il/~hendlerd/papers/p168-expensiveSync...

davidtgoldblatt · 2024-10-03T13:59:15 1727963955

There's two I've tried to do this:

- On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).

- Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.

gpderetta · 2024-10-03T14:47:34 1727966854

> Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).

davidtgoldblatt · 2024-10-03T17:20:42 1727976042

It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.

You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).