Yes, and that's the problem. What Zhang et al [2] showed convincingly in the Ret...

singulargalaxy · 2025-03-20T01:47:46 1742435266

But it's not a problem, it's actually a good thing that OP's explanation is more general. One of the main points in the OP paper is that you do not in fact need proxies or simplification. You can derive generalization bounds that do explain this behavior, without needing to rely on optimization dynamics. This exactly responds to the tests set forth in Zhang et al. OP does not "rely on Bayesian ensembles, or other proxies/simplifications". That seems to be a misunderstanding of the paper. It's analyzing the solutions that neural networks actually reach, which differentiates it from a lot of other work. It also additionally shows how other simple model classes can reproduce the same behavior, and these reproductions do not depend on optimization.

"and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space." But the OP paper explains how even "guess and check" can generalize similarly to SGD. It's becoming more well understood that the role of the optimizer may have been historically overstated for understanding DL generalization. It seems to be more about loss landscapes.

Don't get me wrong, these references you're linking are super interesting. But they don't take away from the OP paper which is adding something quite valuable to the discussion.

cgdl · 2025-03-28T16:56:13 1743180973

Thank you for the great discussion. You've put your finger on the right thing I think. We can now dispense with the old VC-type thinking (i.e., that it's because the hypothesis space is not complex enough that we get generalization). Instead now the real question is this: is it the loss landscape itself, or the particular way in which the landscape is searched that leads to good generalization in deep learning.

One can think of perhaps an "exhaustive" search with say God's computer of the loss landscape and pick an arbitrary point among all the points that minimize (or are close to the minimum). Or with our computers we can merely sample. But in both cases, it's hard to see how one would avoid picking "memorization" solutions in the loss landscape. Recall that in an over-parameterized setting, there will be many solutions that have the same low training loss but very different test losses. The reference in my original post [1] shows a nice example with a toy overparameterized linear model (Section 3) where multiple linear models fit the training data but they have very different generalizations. (It also shows why GD ends up picking the better-generalizing solution.)

Now people have argued that the curvature around the solution is a distinguishing factor between well-generalizing solutions and not. Though already now we are moving into the territory of how to sample the space i.e. the specifics of the searching algorithm (a direction you may not like), but even if we press ahead, it's not a satisfactory explanation since in a linear model with L2 loss, the curvature is the same everywhere as Zhang et al. pointed out. So the curvature theories fail for the simplest case already unless one believes that somehow linear models are fundamentally different from deeper and non-linear models.

[1] points out other troubling facts about the curvature explanation (Section 12), but one I like more than the others is the following: As per curvature theories the reason for good generalization at the start of the training process is fundamentally different from the reason from good generalization at the end of the training process. (As always, generalization is just the difference between test and training, and so good generalization is when that difference is small; not necessarily that the test loss is small.) At the start of the GD training process curvature theories would not be applicable (we just picked a random point after all) and so they would hold that we get good (in fact, perfect) generalization because we didn't look at the training data. However, at the end of training, they say we have good generalization because we found a shallow minima. This lack of continuity is disconcerting. In contrast, stability based arguments provide a continuous explanation: the longer you run SGD the less stable it is (so don't run it too long and you'll be fine since you'll achieve an acceptable tradeoff between lowering the loss and overfitting).

[1]: https://arxiv.org/abs/2203.10036