TL;DR: The OP notes that we currently use all sorts of tricks of the trade, incl...

TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.

I find the idea compelling, but it's too early to know if it will work well at scale, you know, with large models, in the real world.

[a] https://en.wikipedia.org/wiki/Condition_number

EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposing idea oddly compelling, but I don't know how well it works either, because it hasn't been tested at scale.