The distinction that the OP is looking for is generative vs. discriminative inference.
Generative inference: We know the process that generates our observed data and can model it to sufficient precision, s.t. we can establish a forward synthesis: Parameters in, synthesised data out. With that we can compare a synthesized dataset from a configuration of parameters with real observed data and, using Bayes (a.k.a. MAP estimation), infer the correct configuration of parameters that most likely generated our observed data, which is the answer we're looking for. Examples: Kalman Filters, Linear Regression.
Discriminative inference: We do not know the process that generates observed data or it's too difficult to model. In that case we ask machines to take a shot at modelling it. To do that we set up parametric data transformation pipelines (e.g. Neural Nets) and feed it lots of correct input/output pairs. Over time the model will learn how to transform the input (observed data) to the output (answer we're looking for) and hopefully generalise well when we feed it new input data that it hasn't seen before.
Examples: NNs for classification.
Of course there are interesting mixtures and hybrids between the two and the lines are blurred in some cases but this is the general distinction.
For me, the difference between ML and most of what we use stats for in classical research settings (Bayesian or otherwise) has less to do with the specific methodology chosen, and more to do with the purpose of the exercise.
With ML, the utility of the model is largely a function of its predictive power. You want to accomplish some task. "Make it go"
With much of statistics in a research context (such as the social sciences as called out in the link), the interest is more on explanatory power of the independent variables. Most social scientists would happily trade a bit of predictive power for a more explanatory model that neatly maps back to a set of hypotheses. "Why does it go?"
There’s a nice 2010 article on this distinction by Galit Shmueli [0] called To explain or to predict? that explores this distinction in depth for anyone here interested in learning more.
For me, the gradual build up of (1) maximum likelihood estimation to (2) maximum a posteriori estimation to (3) full posterior approximation (or posterior sampling) was helpful to understand where Bayesian methods are in machine learning. Here’s a great video series by Erik Bekkers, who is at the University of Amsterdam. It assumes solid knowledge of calculus & linear algebra and takes you through the math and intuition of all fundamental ML methods: https://youtube.com/playlist?list=PL8FnQMH2k7jzhtVYbKmvrMyXD...
This is very close to my differentiation between the two.
Machine learning fundamentally cares about model performance on other data, like validation or test sets. It is looking for models that perform well, but not necessarily model the underlying process.
Bayesian statistics, like most statistics, wants to accurately estimate parameters in the model. It cares most about models that are portraying the underlying data-generating process.
This touches a little on philosophical difference, but I have found that one major difference is the quantification of uncertainty. Bayesian models, being based naturally around modeling a probability distribution, easily enables the researcher to make claims such as "I am 95% certain the mean lies between x and y, given the data." On the other hand, neural networks, decision trees, and other models more associated with ML than Bayesian statistics do not have this capability built in. (Though there are variations and techniques to build confidence intervals and such with them.)
This isn't true. When you fit a neural network, you are almost always fitting a probability distribution (Bernoulli for binary outcomes, normal for numeric, etc, etc) which can all provide your probability estimates.
When a model for a binary outcomes returns 0.9 for a given data point, that implies a 90% probably that the value is true.
Evaluating the quality of this estimates (often called measuring the calibration of the model) is even very common.
(There are some exception of course. Max margin models aren't probabilistic. And sometimes people use fixed variance parameters for their normal models, etc).
This isn’t what your parent is saying. Many machine learning models are capable of producing calibrated probabilities. What Bayesian models give on top of this is that one doesn’t just predict a probability p, but a posterior distribution for p. This allows for estimating confidence bands around p that quantify ones uncertainty in the estimate. This is useful for assessing data drift and detecting anomalous examples. One can also get such uncertainty bands from non-Bayesian methods by considering ensembles of models. See the review paper by Abdar et al. [0] for more info.
If one only had such an estimate for a single example, this would be true, but in aggregate over many predictions, the uncertainty bands can useful for decision making. This is an active area of research.
Sure, but you do need to take into account that often fitting a neural network consists of finding the maximum likelihood estimate. So from a Bayesian perspective you're ignoring the prior and you risk overfitting by not considering anything but the most likely alternative. Most attempts to avoid overfitting do not really translate well to the Bayesian perspective.
You can actually recover from this a bit. I saw a paper once where they used the Hessian to approximate the posterior as a gaussian distribution around the maximum likelihood. Can't remember what this paper was called unfortunately.
Probability estimates are not the same thing as uncertainty.
Consider tossing a coin. If I see 2 heads and 2 tails, I might report "the probability of heads is 50%". If you see 2000 heads and 2000 tails you'd also report the SAME probability estimate -- but you'd be more certain than me.
Neural networks give probability estimates. Bayesian methods (and also frequentist methods) give us probability estimates AND uncertainty.
The literature on neural network calibration seems to me to have missed this distinction.
It is common for a network to output the distribution, so the output is both the mean and variance instead of just the mean like you pointed out. For example check out variational autoencoders.
In my example, of predicting a coin toss, the naive output is a probability distribution: it's "Prob(heads)=0.5, Prob(tails)=0.5". This is the distribution that will be produced both by the person who sees 2 heads and 2 tails, and by the person who sees 2000 heads and 2000 tails.
Bayesians use the terms 'aleatoric' and 'epistemic' uncertainty. Aleatoric uncertainty is the part of uncertainty that says "I don't know the outcome, and I wouldn't know it even if I knew the exact model parameters", and epistemic uncertainty says "I don't even know the model".
Your example (outputting a mean and variance) is reporting a probability distribution, and it captures aleatoric uncertainty. When Bayesians talk about uncertainty or confidence, they're referring to model uncertainty -- how confident are you about the mean and the variance that you're reporting?
Right, the claim was that "Neural networks give probability estimates. Bayesian methods give us probability estimates AND uncertainty" which presents a false dichotomy. I think we agree.
Ah yes, got you. It is a false dichotomy because it neglects that there’s such a thing as Bayesian neural networks. Also, taking ensembles of ordinary neural networks with random initializations approximates Bayesian inference in a sense and this is relatively well known I think.
Indeed, there are Bayesian neural networks and there are non-Bayesian neural networks, and I shouldn't have implied that all neural networks are non-Bayesian.
I'm just trying to point out that there is a dichotomy between the Bayesian and the non-Bayesian, and that the standard neural network models are non-Bayesian, and that we need Bayesianism (or something like it) to talk about (epistemic) uncertainty.
Standard neural networks are non-Bayesian, because they do not treat the neural network parameters as random variables. This includes most of the examples that have been mentioned in this thread: classifiers (which output a probability distribution over labels), networks that estimate mean and variance, and VAEs (which use Bayes's rule for the latent variable but not for the model parameters). These networks all deal with probability distributions, but that's not enough for us to call them Bayesian.
Bayesian neural networks are easy, in principle -- if we treat the edge weights of a neural network as having a distribution, then the entire neural network is Bayesian. And as you say these can be approximated, e.g. by using dropout at inference time [0], or by careful use of ensemble methods [1].
Quote: "Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty."
Quote: "Ensembling NNs provides an easily implementable, scalable method for uncertainty quantification, however, it has been criticised for not being Bayesian."
Yeah right, in my experience I haven't needed as many networks in the ensemble as I first assumed. This paper [1] suggests 5-10, but in practice I've found only 3 has often been sufficient.
Depends on the loss function. Softmax final activation into cross entropy loss (or KL divergence) gives probability like predictions. This is a very common set up but there are many others that don’t have this property. I figure that’s what you mean by ‘almost always’. You can also use variational inference where you predict a distribution (usually Gaussian so a sigmoid activation with two values per prediction) and use a Wasserstein loss function and this can be used to get confidence intervals among other things.
I've seen the term "machine learning" used without criticism to refer to such a wide range of things that it seems pretty much synonymous now with multivariate discrimination and classification. At one time I would have said "machine learning" meant "deep learning models" but then it seemed to mean "computational multivariate discrimination and classification" and now the "computational" doesn't even apply.
I personally believe "machine learning" is a term that could be removed from use and nothing would suffer (and clarity might even be improved). I feel somewhat similarly about "data science" although at least that captures some intersection of database engineering and computational statistics that is useful when discussing very large data problems.
> seen the term "machine learning" used without criticism
yes
> would have said "machine learning" meant "deep learning models"
no not at all.. DeepLearning has gotten serious technical press since about 2015 when certain competitions yielded better-than-average-human performance. DL is also closely coupled to the business model of massive cloud providers who handle streams of digital media. Yet ML stats continue to reliably predict eighty-percent plus well, on lots of kinds of problems.
> it seemed to mean "computational multivariate discrimination and classification"
no, classification has always been on of two main uses for ML, the other being regression-style prediction of real values.
> believe "machine learning" is a term that could be removed from use
no, stats are useful and continue to be useful. It is AI that is the really wonky term, to my ear.
Please consider that there is a lot of difference between academic statistical methods, and technical press product hype. You may be suggesting a change in media coverage jargon, but you know, good luck with that..
> At one time I would have said "machine learning" meant "deep learning models"
The early ML models used SVMs, Logistic Regression and Naive Bayes. Not deep at all. What made them ML models was the size of the feature sets data, and the use of automated feature selection.
In my non expert opinion it is difference in what model tries to predict, at least in some contexts.
For example in Reinforcement Learning P(R | A, S), eg. probability of reward given action and state, while in Bayesian network P(A | B, C), probability of even A given B, C distributions, in such networks you can do inferences.
I have no idea how correct or false this explanation is.
Machine learning is more broad than Bayesian statistics. More Bayesian statistics methods can be also be called machine learning, but less so the other way around.
Scikitlearn has a bunch of machine learning routines including knn, xdg, decision trees and so one. There is even a model zoo there. Id say bayesian mixed gaussian models to be in the same class of algorithms. Bayesian models have its own peculiarities and statistical foundation, but it isn't far fetched to imagine it inside scikitlearn and to be considered a machine learning routine.
Yes, there is a BayesianGaussianMixture class in sklearn.mixture [0]. There’s also sklearn.gaussian_process [1] which offers Gaussian process classification and regression, Bayesian learning algorithms which can be thought of as analogs of the support vector machine. Rasmussen’s Gaussian Processes for Machine Learning (2006) is a great introduction despite being a little old. [2].
> Books on modeling often jump right into math and methods. Drowned in detail, it can take years to appreciate the assumptions and limitations of the various modeling mindsets.
> Written in a clear and concise style, Modeling Mindsets introduces approaches such as Bayesian inference, supervised learning, causal inference, and more.
> After reading this book, you will have a much better understanding of the different approaches to modeling and be able to choose the right one for your problem.
Edit: Ah darn, I forgot about this part: "You should feel comfortable with at least one of the mindsets in this book". So perhaps not the best start if you don't have a base in at least one method. For frequentist statistics, consider https://www.openintro.org/book/os/
Generative inference: We know the process that generates our observed data and can model it to sufficient precision, s.t. we can establish a forward synthesis: Parameters in, synthesised data out. With that we can compare a synthesized dataset from a configuration of parameters with real observed data and, using Bayes (a.k.a. MAP estimation), infer the correct configuration of parameters that most likely generated our observed data, which is the answer we're looking for. Examples: Kalman Filters, Linear Regression.
Discriminative inference: We do not know the process that generates observed data or it's too difficult to model. In that case we ask machines to take a shot at modelling it. To do that we set up parametric data transformation pipelines (e.g. Neural Nets) and feed it lots of correct input/output pairs. Over time the model will learn how to transform the input (observed data) to the output (answer we're looking for) and hopefully generalise well when we feed it new input data that it hasn't seen before. Examples: NNs for classification.
Of course there are interesting mixtures and hybrids between the two and the lines are blurred in some cases but this is the general distinction.