> for some reason, the structure of the network means that the estimated output learns realistic features first, and then overfits to the noise afterwords.
Why does this happen? What characteristics does the network structure have that cause this effect?
Why does this happen? What characteristics does the network structure have that cause this effect?