Bayesian modeling has a somewhat distinct feeling to both (typical) deep learnin...

Bayesian modeling has a somewhat distinct feeling to both (typical) deep learning algorithms and boosting/bagging classifiers.

Most particularly, Bayesian modeling tends to be generative modeling as opposed to discriminative. This means that you construct your model by describing a process which generates your observed data from a set of latent/unknown quantities.

For instance, we might observe that n[u, d] clicks are observed on user u on day d for various choices of u and d. We could build a variety of generative stories here: that n[u, d] is independent of u and d, just being a random draw from a Normal(mu, sigma) distribution; that n[u, d] incorporates another unknown parameter p[u], the user's propensity to click, and then is a random draw from Normal(mu + b p[u], sigma); or that we also include season trends sm[d] and ss[d] to both the mean and spread of n[u, d], saying it's Normal(mu + b p[u] + sm[d], sigma * ss[d]).

In these examples, the unknown latents are parameters like mu, sigma, and b as well as any latent data needed to give shape to p[-], sm[-], and ss[-]. Once we've posited the structure of this generative model, we'd like to infer what values those latents might take as informed by the data.

This is the bread and butter of Stan modeling. It lets you describe these generative models as a "forward" process where we sample latents in a simple forward program. Similar to Tensorflow/etc Stan extracts from this forward program a DAG and computes derivatives, but instead of simply maximizing an objective function through backdrop, Stan uses these derivatives to perform a sampling algorithm over the latents (mu, sigma, b).

Ultimately, this gives you a distribution of plausible latent configurations given the data you've observed. This distribution is a key point of Bayesian modeling and can provide a lot of information beyond what the objective-maximizing value would. As a simple example, it's trivial from a Bayesian output distribution to make statements like "we're 95% confident that mu > 0.1".