Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Proper initialization is more important.

Batch norm and others are important for faster convergence due to forcing the model to focus creating second and higher order nonlinearities, as a simple shift in mean/std is normalized out, and thus the gradient does not point in a direction that would only change those properties of the output distribution.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: