RMSNorm is pretty insigificant in terms of the overall compute in a transformer ...

londons_explore · 2025-03-15T07:33:55 1742024035

Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).

elcritch · 2025-03-15T08:31:43 1742027503

What are other barriers in transformers? Or is the normalization layer the primary one?

woadwarrior01 · 2025-03-15T09:13:29 1742030009

dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.

amitport · 2025-03-15T15:07:55 1742051275

that fail... linearization is a bad idea. But plenty of other optimizations are done

atgctg · 2025-03-15T10:40:15 1742035215

The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.

Herring · 2025-03-15T16:46:32 1742057192

But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.