Thanks for the response. Hmm, it's still pretty mysterious to me. Why should a d...

bananaface · on Oct 21, 2020

I think a deeper network has less degrees of freedom in which to move, or rather, in which to move usefully, because parameters are more interdependent. That means in order to generate a useful function, it has to learn more abstract features than a shallow and wide network. This is because any adjustment to irrelevant features that are unique to a small number of examples has a larger negative impact on the rest of the data than it would in a shallower net (except in the latter layers). Over time, abstract changes are stochastically rewarded and specific changes are penalised, at least for the earlier layers, and the latter layers then have to work with this more abstract information so they simply can't overfit that much.

Would be interested in OP's take on this though.