Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane
Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).
Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.
Thanks so much for a thought provoking comment, that's incredibly cool.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane