RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.
Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.
Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.
As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).