I had the pleasure of reverse-engineering win32 SRWLOCKs, and based on the author description of nsync it is very close to how SRWLOCK works internally. Kind of surprised how much faster nsync is compared to SRWLOCK.
The post doesn't include any benchmarks for the uncontended case, where I've found SRWLock is also very fast, I'm curious why this wasn't included.
At least for what I use locks for, the uncontended case is like 10000x more common, I actually don't think I have any super heavy contention such as the case shown in the post, as this is simply something to be avoided--as no no matter how good the mutex this won't play well with the memory system.