Fine-grained locking with two-bit mutexes

bcrl · on Dec 7, 2022

We used simple bit spinlocks early in the multiprocessor scalability work done on the Linux kernel. While nearly "optimal" in terms of memory usage and simplicity, they tend to fall apart rather dramatically on systems with higher core counts. But they're a great step in the process of learning how to walk before running on bigger MP systems...

gpderetta · on Dec 7, 2022

Note the mutex discussed in the article is not a spin lock, it actually sleeps on contention.

MuffinFlavored · on Dec 8, 2022

Sleep as in configure a timer and then wait for an interrupt?

a1369209993 · on Dec 8, 2022

Sleep as in do a syscall to tell the kernel to unschedule it.

the_optimist · on Dec 7, 2022

Can you elaborate and reference more on the “fall-apart” and approaches with more cores?

arberx · on Dec 7, 2022

An efficient spinlock relies on a good kernel scheduler. I imagine that's the biggest hindrance to performance in a multicore system.

bcrl · on Dec 8, 2022

No, the scheduler has nothing to do with the performance of spinlocks. By definition, a spinlock "spins" until the lock is acquired. When you have multiple CPUs performing atomic operations on a single cache line, that cache line has to bounce around between CPUs. Cache lines bouncing around degrades performance. This is the main motivation behind other spinlock implementations like ticket spinlocks which upon release grant the lock to a specific waiting CPU.

This really only matters on contended spinlocks, which is not normally the case in a scalable system. But there are always pathological workloads that stress performance in challenging ways.

throwawaylinux · on Dec 8, 2022

This isn't directed at you, but to add some more flavor - bit spinlocks require an atomic unlock, whereas byte/word ones can be non-atomic. Which isn't so much problem these days but it used to be a big one against bit spinlocks when locked operations took hundreds of cycles and even went on the bus and stopped everyone else accessing the memory.

But even byte/word spin locks that could avoid the atomic unlock, and even in modern systems where atomic operations are quite cheap relative to cache miss, still have two major problems, the "thundering herd" effect, and unfairness/starvation.

The basic spin lock is just a bit or byte or word in memory where waiters just keep reading the lock value until it becomes unlocked, and then they attempt to lock it. If that fails, continue waiting.

The thundering herd problem is at the cache coherency level. If you have a bunch of CPUs waiting on the lock then when the lock is released at the cache coherency level they all have that line invalidated, and then they will each request the line again, see the new (unlocked) value, and they will all try to take the lock, and each of those lock operations is an exclusive access to the line which also invalidates all others, and in the end only one succeeds, so all the others were wasted and go back to waiting, pulling the line back into shared state. So N CPUs taking a contended spin lock can become an O(N^2) operation all up. Coherency protocols might be smart enough to mitigate this somewhat so it may not be not quite so bad, but it's always going to be costly and those hardware heuristics could break down in extreme cases. So the effective critical section including the lock overhead can grow quickly as you increase CPU count, which can cause scalability to not just hit a wall but go off a cliff and get worse as you get more CPUs contending for the lock. Which is obviously bad, but the negative feedback spiral is particularly bad (workload slows down -> more CPUs recruited to handle workload -> workload gets slower) and means you can get stuck in a degraded state beyond the overload condition.

The fairness problem is difficult as well. A cache coherency protocol itself can be fair and anti-starvation, in that a CPU will get access to a line in a relatively equal and timely manner, and that could be mathematically proven and verified in the hardware. But when you start building on top of that, those properties don't automatically translate to the higher level primitive. In the above situation where it's a free-for-all on the lock word, any one waiter may get access to the line fairly and never be starved at the hardware level, but each time it performs the atomic operation to acquire the lock it might find that it has been taken by another CPU -- the hardware doesn't know what the lock is or that the value 0x1234 means it's free. So to truly have fair and non-starving locks, you need to build them with those guarantees in software, and only rely on the hardware for atomic operations.

The ticket locks that parent poster refers to are one such solution for fairness as they're FIFO. They were brought in to Linux due to unfairness/starvation problems. They are since superseded by "queued" spinlocks which are also FIFO but instead of spinning on the lock word each CPU spins on its own area, so unlock becomes closer to a 1:1 operation at the coherency level rather than 1:N (using the "MCS" algorithm) which tackles the former thundering herd problem.

kccqzy · on Dec 8, 2022

An interesting discussion I recently found on this topic was the Rust discussion on improving mutex and related types in their standard library: https://github.com/rust-lang/rust/issues/93740 A few posts later someone was discussing the implementation of various lock implementations (including musl libc, glibc, wine, Windows, boost, Darwin, FreeBSD) to keep notes. I learned a lot from those discussions.

pizlonator · on Dec 8, 2022

JavaScriptCore has used this technique for ~5-6 years now.

Reason: some rare cases of interaction between the thread running JS and the JIT or GC threads require somehow locking some of the object’s state. Objects need to be small since adding even one word costs multiple %’s of perf. And there happened to be two bits available in the indexingType byte in the header.

bremac · on Dec 8, 2022

Thanks for mentioning this, I couldn't think of where I'd read about this technique before. I was remembering https://webkit.org/blog/6161/locking-in-webkit/

gpderetta · on Dec 8, 2022

I might be missing something, but I think that the handling of the sleeper state is wrong.

   mutex is unlocked (state: unlocked)
   A locks mutex (state: locked -> locked)
   B attempts to lock mutex (state: locked -> sleeper)
   C attempts to lock mutex (no changes)
   A unlock mutex (state: sleeper -> unlocked), state was sleeper so notify_one B
   B locks the mutex (state: unlocked ->locked)
   B unlocks mutex  (state: locked -> unlocked), state was locked, so do not notify

C hangs forever. To use notify_one you need to either guarantee at most one waiter (making notify_one pointless) or you need a wait count. If you use only one bit (or when your count saturates, which is the same thing), you either need to notify_all (and deal with the thundering herd) or unconditionally call notify_one (and do away with the fast path). IIRC the original futex paper had a discussion of this.

Incidentally, I love the atomic::wait in c++20, although it is not without issues https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p26...).

molyss · on Dec 8, 2022

When B wakes up the 1st thing it does is state.exchange(sleeper,…), which returns unlocked, and set the state to sleeper. I think it works.

I had some concerns about the second one, but I think it works too…

gpderetta · on Dec 8, 2022

Oh, now I see it, very nice. I had missed that there is at least one additional exchange after wait returns.

sakras · on Dec 8, 2022

An interesting thing to try with the two-bit mutex is to add exponential back off. I suspect it’ll dramatically improve the latency under contention.

Another interesting thing to point out is that there seems to be a crossover point between when using an array of std::mutex is better vs having an array of these pointer-locks. Imagine you have an array of pointers to objects that you want to lock. With the pointer locks, you’ll have 8 mutexes per cache line, which for small arrays will probably have a high degree of false sharing. With an array of std::mutex, you’d probably pad them out to 64 bytes and there would be no false sharing. But for larger arrays (assuming a uniform distribution of accesses), the probability of false sharing would be low enough that the pointer-mutex would be faster.

The last thing I have to say is that I’ve been reading a lot about memory models lately, and I’ve become terrified of trying to mix TLA+ and non-seq_cst memory orders. Since TLA+ produces a global total order of actions, you still haven’t quite properly model checked your program if you haven’t modeled C++’s memory in your spec. That said, you can probably get away with it most of the time (especially on x86).

jeffbee · on Dec 8, 2022

Just putting an x86 pause in your spin loop will dramatically improve the mean latency under contention, but it also leads to huge tail latency explosions because it has no fairness mechanism.

sakras · on Dec 8, 2022

You can also dramatically reduce the latency by testing before attempting the CAS, which will switch a lot of cache line requests into Shares instead of Owns. I didn’t see it in this implementation.

But the point of the wait/notify in the post to sleep if I understand correctly. My guess is there is some amount of spurious waking up that’s happening, or multiple threads are woken up by the notify_one (it only guarantees at least one is woken up!) An exponential backoff would reduce the probability of these spurious wakeups colliding dramatically.

On a separate note, fairness isn’t always required, just throughput. If you need predictable latency then this lock won’t work for you, but in this case optimizing for mean latency seems more worthwhile.

JonChesterfield · on Dec 9, 2022

You can do better than CAS for spinlocks on bits. The representation needs to handle locked, unlocked, try-lock. Let unlocked be 0 and locked be 1. So try-lock can be done with a compare and swap with one bit set and the rest a guess at the remainder of the word, then loop.

Instead of that, use fetch_or with only the corresponding bit set. That leaves the rest of the word unchanged and the locking position set to one.

Then check the return value. You know the lock is taken now. If the bit is set in the return, it was locked before your fetch_or. So try-lock failed and changed nothing. If the bit is clear in the return, try-lock succeeded and you now hold the lock.

This eliminates the guess the word part of using CAS and eliminates the retry loop you get because other bits in the word might have changed while you were trying to lock it.

throwawaylinux · on Dec 8, 2022

You can implement a 1-bit mutex, by keeping the contended state in a side structure. The side structure need only scale with about the number of contending threads, or approximately the number of total threads, rather than the number of locks.

gpderetta · on Dec 8, 2022

It is complicated. The side structure for a futex is kernel side, so you want to fast path the userspace to avoid the syscal.

On the other hand atomic::wait doesn't necessarily map directly to a futex_wait (nor it can in all cases, futexes are always exactly 32 bits). IIRC, at least on libstdc++, atomic::{wait,notify_one} are user-space fast pathed so they already keep some side structure, so you could get away with calling notify_one unconditionally and use only one bit.

I think it is a mistake on libstdc++ part though. I really would like atomic to be a thin wrapper around futex and, as you can't rely on the fast path being available everywhere you need either platform specific code or suboptimal duplicate code.

In fact libstdc++ does more as it adds spin-waiting and more optimizations which are not necessarily wanted.

throwawaylinux · on Dec 9, 2022

I'm talking about in general you can implement a 1-bit mutex. Regardless of how you sleep and wake, you make a side structure which contains the contended state of the mutex.

In the kernel of course you could just reuse your wait queue structure for that, which is exactly what bit mutexes do in Linux, but that's an implementation detail to my point.

95014_refugee · on Dec 7, 2022

Owner tracking and priority inheritance, anyone?

jimmyed · on Dec 8, 2022

To prevent priority inversion?

cryptonector · on Dec 8, 2022

You can't have that with futexes. That would require a system call for every lock and unlock, not just for contended ones.

It's a trade-off.

EdSchouten · on Dec 8, 2022

That’s not true. Just create a 32 or 64 bit futex, and use that to let threads write their thread ID into it.

https://www.kernel.org/doc/Documentation/pi-futex.txt

cryptonector · on Dec 8, 2022

But the whole point here is to have teeny tiny futexes.

bawolff · on Dec 7, 2022

Sligtly off topic, but did anyone else do a double-take at the "make $1000 donation" form at the bottom?

anamexis · on Dec 8, 2022

The author justifies it immediately above:

> Also I’m trying out a donation button for the first time. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated.

But also, you miss 100% of the shots you don't take.

meta2023 · on Dec 7, 2022

Readable font-sizes don't come cheap!

zamalek · on Dec 7, 2022

One thing to be careful of is that mutexes are fair, spinlocks are not. It is possible for a worker to wait indefinitely behind workers acquiring the lock at a later time.

connicpu · on Dec 7, 2022

Mutexes aren't necessarily fair, and there's some good reasons you may not want true fairness in your lock[1] (that said, most implementations try to be as fair as they can be without introducing too much overhead)

[1]: http://joeduffyblog.com/2006/12/14/anticonvoy-locks-in-windo...

zamalek · on Dec 7, 2022

Right, but spinlocks have no concept of fairness at all. With a mutex on thread might experience unfairness for a few ms, a spinlock has no bound.

jimmyed · on Dec 8, 2022

It's not mutexes that are fair, it's the scheduling that is fair. Scheduling tries to prevent starvation and schedules the preempted mutex thread at least periodically.

zamalek · on Dec 8, 2022

That's unnecessary pedantry, in context. I am explaining an important con of spinlocks. Whether it is because mutexes are fair or because they behave fair is completely irrelevant. The point is that if you use spinlocks (which the article incorrectly calls mutexes), then you could see unintuitive behavior under load.

gpderetta · on Dec 8, 2022

You can have fair spinlocks if you handle queuing yourself.

Also the article does indeed implement a real mutex (although a buggy one), not a spinlock. std::atomic::wait will use futexes (or whatever) underneath to genuinely put the process to sleep until it receives a wakeup.

zamalek · on Dec 11, 2022

Are the locks presented in the article guaranteed to be acquired in approximate FIFO order?

gpderetta · on Dec 11, 2022

I don't think atomic::wait/notify does guarantee any fairness. It would need to be built on top.

gpderetta · on Dec 8, 2022

Btw, I retract my "buggy" comment.

cryptonector · on Dec 8, 2022

This isn't a spinlock. This is a futex.