I think tialaramex is overselling the complexity of concurrent memory models in practice, at least for end users. In reality, all modern memory models are based on the data-race-free theorem, which states that in the absence of data races--if your program is correctly synchronized--you can't tell that the hardware isn't sequentially consistent (i.e., what you naïvely expected it to do).
Correct synchronization is based on the happens-before relation; a data race is defined as a write and a conflicting read or write such that neither happens-before the other. Within a thread, happens-before is just regular program order. Across a thread, the main happens-before that is relevant is that an release-store on a memory location happens-before an acquire-load on that memory location (this can be generalized to any memory location if they're both sequentially-consistent, but that's usually not necessary).
The real cardinal rule of concurrent programming is to express your semantics in the highest-possible level of what you're trying to do, and find some library that does all the nitty-grityy of the implementation. Can you express it with fork-join parallelism? Cool, use your standard library's implementation of fork-join and just don't care about it otherwise.
Correct synchronization is based on the happens-before relation; a data race is defined as a write and a conflicting read or write such that neither happens-before the other. Within a thread, happens-before is just regular program order. Across a thread, the main happens-before that is relevant is that an release-store on a memory location happens-before an acquire-load on that memory location (this can be generalized to any memory location if they're both sequentially-consistent, but that's usually not necessary).
The real cardinal rule of concurrent programming is to express your semantics in the highest-possible level of what you're trying to do, and find some library that does all the nitty-grityy of the implementation. Can you express it with fork-join parallelism? Cool, use your standard library's implementation of fork-join and just don't care about it otherwise.