This isn't directed at you, but to add some more flavor - bit spinlocks require ...

This isn't directed at you, but to add some more flavor - bit spinlocks require an atomic unlock, whereas byte/word ones can be non-atomic. Which isn't so much problem these days but it used to be a big one against bit spinlocks when locked operations took hundreds of cycles and even went on the bus and stopped everyone else accessing the memory.

But even byte/word spin locks that could avoid the atomic unlock, and even in modern systems where atomic operations are quite cheap relative to cache miss, still have two major problems, the "thundering herd" effect, and unfairness/starvation.

The basic spin lock is just a bit or byte or word in memory where waiters just keep reading the lock value until it becomes unlocked, and then they attempt to lock it. If that fails, continue waiting.

The thundering herd problem is at the cache coherency level. If you have a bunch of CPUs waiting on the lock then when the lock is released at the cache coherency level they all have that line invalidated, and then they will each request the line again, see the new (unlocked) value, and they will all try to take the lock, and each of those lock operations is an exclusive access to the line which also invalidates all others, and in the end only one succeeds, so all the others were wasted and go back to waiting, pulling the line back into shared state. So N CPUs taking a contended spin lock can become an O(N^2) operation all up. Coherency protocols might be smart enough to mitigate this somewhat so it may not be not quite so bad, but it's always going to be costly and those hardware heuristics could break down in extreme cases. So the effective critical section including the lock overhead can grow quickly as you increase CPU count, which can cause scalability to not just hit a wall but go off a cliff and get worse as you get more CPUs contending for the lock. Which is obviously bad, but the negative feedback spiral is particularly bad (workload slows down -> more CPUs recruited to handle workload -> workload gets slower) and means you can get stuck in a degraded state beyond the overload condition.

The fairness problem is difficult as well. A cache coherency protocol itself can be fair and anti-starvation, in that a CPU will get access to a line in a relatively equal and timely manner, and that could be mathematically proven and verified in the hardware. But when you start building on top of that, those properties don't automatically translate to the higher level primitive. In the above situation where it's a free-for-all on the lock word, any one waiter may get access to the line fairly and never be starved at the hardware level, but each time it performs the atomic operation to acquire the lock it might find that it has been taken by another CPU -- the hardware doesn't know what the lock is or that the value 0x1234 means it's free. So to truly have fair and non-starving locks, you need to build them with those guarantees in software, and only rely on the hardware for atomic operations.

The ticket locks that parent poster refers to are one such solution for fairness as they're FIFO. They were brought in to Linux due to unfairness/starvation problems. They are since superseded by "queued" spinlocks which are also FIFO but instead of spinning on the lock word each CPU spins on its own area, so unlock becomes closer to a 1:1 operation at the coherency level rather than 1:N (using the "MCS" algorithm) which tackles the former thundering herd problem.