> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.
You're correct, and the term you're looking for is "regularisation".
There are two common ways of doing this:
* L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements)
* Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights
Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane
Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).
Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.
Thanks so much for a thought provoking comment, that's incredibly cool.
The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!
Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.
I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)
The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.
What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?
Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.