[Layer] Normalization constrains huge vectors representing tokens (input fragmen...

[Layer] Normalization constrains huge vectors representing tokens (input fragments) to positions on a unit ball (I think), and the attention mechanism operates by rotating the unconstrained ones based on the sum of their angles relative to all the others.

I only skimmed the paper but believe the point here is that there are relatively simple functions hiding in or recoverable from the bigger network which specifically address certain categories of relationships between concepts.

Since it would, in theory, be possible to optimize such functions more directly if they are possible to isolate, could this enable advances in the way such models are trained? Absolutely.

After all, one of the best criticisms of “modern” AI is the notion we’re just mixing around a soup of linear algebra. Allowing some sense of modularity (reductionism) could make them less of a black box and more of a component driven approach (in the lagging concept space and not just the leading layer space)