Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.


Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).


What are other barriers in transformers? Or is the normalization layer the primary one?


dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.


that fail... linearization is a bad idea. But plenty of other optimizations are done


The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.


But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: