Hacker News new | past | comments | ask | show | jobs | submit login

A decade ago the paper "Understanding deep learning requires rethinking generalization" [0] was published. The submission is a response to that paper and subsequent literature.

Deep neural nets are notable for their strong generalization performance: despite being highly overparametrized they do not seem to overfit the training data. They still perform well on hold-out data and very often on out of distribution data "in the wild". The paper [0] noted a particularly odd feature of neural net training: one can train neural nets on standard datasets to fit random labels. There does not seem to be an inductive bias strong enough to rule out bad overfitting. It is in principle possible to train a model which performs perfectly on the training data but gives nonsense on the test data. But this doesn't seem to happen in practice.

The submission argues that this is unsurprising, and fits within standard theoretical frameworks for machine learning. In section 4 it is claimed that this kind of thing ("benign overfitting") is common to any learning algorithm with "a flexible hypothesis space, combined with a loss function that demands we fit the data, and a simplicity bias: amongst solutions that are consistent with the data (i.e., fit the data perfectly), the simpler ones are preferred".

The fact that the third of these conditions is satisfied, however, is non-trivial, and in my opinion is still not well understood. The results of [0] are reproducible with a wide variety of architectures, with or without any form of explicit regularization. If there is an inductive bias toward "simpler solutions" in fitting deep neural nets it has to come either from SGD itself or from some bias which is very generic in architecture. It's not something like "CNNs generalize well on image data because of an inductive bias toward translation invariant features." While there is some work on implicit smoothing by SGD, for example, in my opinion this is not sufficient to explain the phenomena observed in [0]. What I would find satisfying is a reproducible ablation study of neural net training that removed benign overfitting (+), so that it was clear what exactly are the necessary and sufficient conditions for this behavior in the context of neural nets. As far as I know this still has never been done, because it is not known what this would even entail.

(+) To be clear, I think this would not look like "the fit model still generalizes, but we can no longer fit random labels" but rather "the fit model now gives nonsense on holdout data".

[0] https://arxiv.org/abs/1611.03530




Doesn't the simplicity bias come explicitly from regularization techniques, including drop out or l2 norm?


Those are not necessary to reproduce benign overfitting




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: