Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the response. Hmm, it's still pretty mysterious to me. Why should a deep network with the same number of parameters as a wide network represent a wider variety of functions? In some sense they represent the "same" number of functions, in the sense that the manifold of functions given by two network architectures with the same number of parameters have have the same dimension, even if one is wide while the other is deep.


I think a deeper network has less degrees of freedom in which to move, or rather, in which to move usefully, because parameters are more interdependent. That means in order to generate a useful function, it has to learn more abstract features than a shallow and wide network. This is because any adjustment to irrelevant features that are unique to a small number of examples has a larger negative impact on the rest of the data than it would in a shallower net (except in the latter layers). Over time, abstract changes are stochastically rewarded and specific changes are penalised, at least for the earlier layers, and the latter layers then have to work with this more abstract information so they simply can't overfit that much.

Would be interested in OP's take on this though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: