Just putting an x86 pause in your spin loop will dramatically improve the mean latency under contention, but it also leads to huge tail latency explosions because it has no fairness mechanism.
You can also dramatically reduce the latency by testing before attempting the CAS, which will switch a lot of cache line requests into Shares instead of Owns. I didn’t see it in this implementation.
But the point of the wait/notify in the post to sleep if I understand correctly. My guess is there is some amount of spurious waking up that’s happening, or multiple threads are woken up by the notify_one (it only guarantees at least one is woken up!) An exponential backoff would reduce the probability of these spurious wakeups colliding dramatically.
On a separate note, fairness isn’t always required, just throughput. If you need predictable latency then this lock won’t work for you, but in this case optimizing for mean latency seems more worthwhile.