Well-done post, I'd like to read more of their work and it's exciting to see the...

Well-done post, I'd like to read more of their work and it's exciting to see these new ideas. Though as other people have said, the one set of empirical results that they present is a bit... confusing? I'd think they'd have some more compelling examples to present given all the pretty math.

Their modular norm paper (https://arxiv.org/abs/2405.14813) has several more examples; see their appendix D in particular, but these are also mystifying. Yes they're interested in how things scale but am I the only one to whom it seems that the training losses they report are just not competitive with things that are currently being used?