> rather than restricting the hypothesis space to avoid overfitting, embrace a f...

whiteandnerdy · 2025-03-17T19:32:43 1742239963

You're correct, and the term you're looking for is "regularisation".

There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights

levocardia · 2025-03-17T20:19:51 1742242791

Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.

Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.

Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane

whiteandnerdy · 2025-03-18T10:42:11 1742294531

Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).

Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.

Thanks so much for a thought provoking comment, that's incredibly cool.

jonathanhuml · 2025-03-17T20:18:51 1742242731

The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!

chriskanan · 2025-03-17T20:43:33 1742244213

Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511

Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.

bornfreddy · 2025-03-17T18:38:44 1742236724

I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)

The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.

What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?

woopwoop · 2025-03-17T19:09:58 1742238598

Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".

eli_gottlieb · 2025-03-22T16:15:43 1742660143

Yeah that's just regularized optimization which is actually just the Bayesian Learning Rule which is actually just variational Bayes.

smus · 2025-03-18T07:12:22 1742281942

the AdamW optimizer (basically the default in DL nowadays) is doing exactly that