Stan is a state-of-the-art platform for statistical modeling

hendzen · on Dec 23, 2020

If you want to learn Stan I highly recommend the book Statistical Rethinking (2nd Ed) by Richard McElreath. It’s a pedagogical masterpiece and light years away the best resource I’ve found on learning Bayesian inference.

elsherbini · on Dec 23, 2020

Seconded. He has a full course on youtube as well, and a free version of the textbook that is just missing the last chapter available on his website (password is in the first or second lecture on youtube)

https://www.youtube.com/watch?v=4WVelCswXo4&list=PLDcUM9US4X...

agravier · on Dec 26, 2020

First. blossom.

nextos · on Dec 23, 2020

Statistical Rethinking is not bad, but I think it's for people with backgrounds different than CS (or Math).

Personally, I think https://probmods.org/ is an exceptionally good introduction to probabilistic programming for someone that knows CS or just some programming and likes a SICP-like textbook that goes into the essence of the topic.

Learning Stan is great, but not as a first probabilistic programming language, because it's quite limited (it trades model expressiveness for performance). So you can't represent a large set of models, such as infinite mixtures, which may become really relevant in the future developments of deep learning. It also has poor performance in models that involve many discrete variables.

stevesimmons · on Dec 23, 2020

The Statistical Rethinking book uses R.

For people wanting Python, Jupyter notebooks with Python code examples are here:

* https://github.com/pymc-devs/resources/tree/master/Rethinkin...

glial · on Dec 23, 2020

Facebook’s (very good) Prophet forecasting library is a wrapper for Stan models.

https://facebook.github.io/prophet/

astrophysician · on Dec 24, 2020

Just to clarify: prophet implements a particular model with Stan as a backend, not Stan models generally.

wodenokoto · on Dec 23, 2020

What kind of problems does Stan / Bayesian inference beat the much more hyped Tensorflow / deep learning approach?

Often you hear that deep learning is best at unstructured data (images, sound and recently raw text) and boosted trees / XG boost for tabular data.

credit_guy · on Dec 23, 2020

Both Bayesian inference and deep learning can do function fitting, i.e. given a number of observations y and explanatory variables x, you try to find a function so that y ~ f(x). The function f can have few parameters (e.g. f(x)= ax+b for linear regression) or millions of parameters (the usual case for deep learning). You can try to find the best value for each of these parameters, or admit that each parameter has some uncertainty and try to infer a distribution for it. The first approach uses optimization, and in the last decade, that's done via various flavors of gradient descent. The second uses Monte Carlo. When you have few parameters, gradient descent is smoking fast. Above a number of parameters (which is surprisingly small, let's say about 100), gradient descent fails to converge to the optimum, but in many cases gets to a place that is "good enough". Good enough to make the practical applications useful. In pretty much all cases though, Bayesian inference via MCMC is painfully slow compared to gradient descent.

But there is a case where it makes sense: when you have reasonably few parameters, and you can understand their meaning. And this is exactly the case of what's called "statistical models". That's why STAN is called a statistical modeling language.

How is that? Gradient descent for these small'ish models is just MLE (maximum likelihood estimation). People have been doing MLE for 100 years, and they understand the ins and outs of MLE. There are some models that are simply unsuited for MLE; their likelihood function is called "singular"; there are places where the likelihood becomes infinite despite the fit being quite poor. One way to fix that is to "regularize" the problem, i.e. to add some artificial penalty that does not allow the reward function to become infinite. But this regularization is often subjective. You never know when the penalty you add is small enough to not alter the final fit. Another way is to do Bayesian inference . It's very slow, but you don't get pulled towards the singular parameters.

scottfr · on Dec 23, 2020

Stan: Predict the values of parameters in a model

Deep Learning: Predict an outcome variable

For example, if I want to know what effect household income has on a student's chance of getting into college, Stan would allow you to estimate that given a proposed model.

If instead I wanted to predict a given student's chance of getting into college, I might use Machine Learning.

Of course, those two problems are linked, but it's a fundamental difference of focus.

nightski · on Dec 23, 2020

While it is true that Bayesian inference is very powerful in that it allows one to introspect and view effects of the model's parameters on the outcome, it is equally as good at predicting the outcome variable as well. It just depends on what you want to get out of it. In fact you get more information about your outcome variable from Bayesian Inference as it is a distribution.

I'm not saying it is better than DL by any means, as DL can scale much better. Just that I don't think it's necessary to pigeonhole Bayesian inference to just predicting the parameters. In my opinion the "fundamental difference of focus" is just a personal decision, not something inherent to the method.

borroka · on Dec 23, 2020

The focus of statistical models (including Bayesian models) is on inference and uncertainty (both for parameter values and for predictions), the focus of ML models (including DL models) is on prediction and it is rarely possible to obtain any quantification of uncertainty.

jgalt212 · on Dec 23, 2020

> rarely possible to obtain any quantification of uncertainty.

Can't this be estimated via bootstrapping?

borroka · on Dec 24, 2020

It is very challenging for complex models to know what's the coverage (and also could be extremely computationally intensive).

peteradio · on Dec 23, 2020

I guess Bayesian will tend to be underfit while DL may tend to overfit.

darthdeus · on Dec 23, 2020

Stan gives you the ability to do probabilistic reasoning. There is actually Tensorflow Probability (https://www.tensorflow.org/probability) which has a lot of overlapping algorithms, but isn't as mature and approaches some things differently.

The main difference is that with Stan you think in terms of random variables and distributions (and their transformations), while with Tensorflow/DL you think in terms of predicting directly from data. Stan lets model a problem with probabilities and do arbitrary inference, generally asking any question you want about your model.

There are many other interesting alternatives, e.g. http://pyro.ai/ which takes a yet another approach merging DL and probabilistic programming with variational inference. (Stan and TFP can do variational inference too, but I guess it's like Python vs JavaScript vs Ruby vs Java - all of them can be used for programming, but not the same way).

usgroup · on Dec 23, 2020

The next cut of Stan will likely use TFP as a backend. I think that PyMC4 will also. The Stan team wrote everything from scratch in C++ including their own autodiff code which many regard as quite a stretch in terms of long term maintenance. Since TFP executes on top of Tensorflow things like autodiff and many of the other performance concerns that take up so much Stan-dev time are already taken care of.

abhgh · on Dec 23, 2020

PyMC4 on TFP was the plan, but they made a recent announcement [1] indicating those efforts would stop, and instead, they would develop PyMC3+JAX+Theano.

[1] https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i...

diab0lic · on Dec 23, 2020

Woah. Thanks for the link, as a PyMC3 user I was not looking forward to the transition to 4 expecting to have to relearn the API like the transition from 2 to 3. I was debating wether I should learn 4 or switch to a different library when all I really wanted to do was stick with 3.

Looks like I get the best of both worlds now.

lhomdee · on Dec 23, 2020

Please no, we don’t need Stan to be rebuilt with a Python backend. That it’s built in C++ and can be called with higher level API’s is part of the appeal.

tel · on Dec 23, 2020

Bayesian modeling has a somewhat distinct feeling to both (typical) deep learning algorithms and boosting/bagging classifiers.

Most particularly, Bayesian modeling tends to be generative modeling as opposed to discriminative. This means that you construct your model by describing a process which generates your observed data from a set of latent/unknown quantities.

For instance, we might observe that n[u, d] clicks are observed on user u on day d for various choices of u and d. We could build a variety of generative stories here: that n[u, d] is independent of u and d, just being a random draw from a Normal(mu, sigma) distribution; that n[u, d] incorporates another unknown parameter p[u], the user's propensity to click, and then is a random draw from Normal(mu + b p[u], sigma); or that we also include season trends sm[d] and ss[d] to both the mean and spread of n[u, d], saying it's Normal(mu + b p[u] + sm[d], sigma * ss[d]).

In these examples, the unknown latents are parameters like mu, sigma, and b as well as any latent data needed to give shape to p[-], sm[-], and ss[-]. Once we've posited the structure of this generative model, we'd like to infer what values those latents might take as informed by the data.

This is the bread and butter of Stan modeling. It lets you describe these generative models as a "forward" process where we sample latents in a simple forward program. Similar to Tensorflow/etc Stan extracts from this forward program a DAG and computes derivatives, but instead of simply maximizing an objective function through backdrop, Stan uses these derivatives to perform a sampling algorithm over the latents (mu, sigma, b).

Ultimately, this gives you a distribution of plausible latent configurations given the data you've observed. This distribution is a key point of Bayesian modeling and can provide a lot of information beyond what the objective-maximizing value would. As a simple example, it's trivial from a Bayesian output distribution to make statements like "we're 95% confident that mu > 0.1".

usgroup · on Dec 23, 2020

Stan is exceptional if what you need is a hierarchical Bayesian model, and if what you want is rigorous way of quantifying the uncertainty associated in the parameter selections in your model.

Stan users are more often R users than Python user and mostly come from science backgrounds. They often use Stan via a package called BRMS which stands for "Bayesian Regression Models using Stan" which should give you some idea of its core use case.

You wouldn't use Stan if you weren't trying to model your problem as a distribution based probabilistic model.

nabla9 · on Dec 23, 2020

1) You have too little data for Deep Learning

2) You want to do statistical modelling, not a black box. You already have a statistical model in mind, you just want to fit parameters.

Stan is probabilistic programming system. You describe the data-producing mechanism (the model of reality), and the level and form of approximations used in the estimation. The compiler generates code for the estimators.

celrod · on Dec 23, 2020

It's used a lot for things like analyzing clinical trials, e.g making futility or early stopping calls in interims, or for meta analysis. JAGS may still be the most popular, at least in some companies, but Stan is starting to catch on thanks to its greater flexibility in most respects.

lhomdee · on Dec 23, 2020

Other comments point out to Bayesian inference being good for modelling an uncertain outcome, while deep learning is good for prediction.

However Bayesian inference is a good choice for prediction when you have few data points (deep learning is sample-size hungry). And it is especially good when you have high uncertainty in your labelled training data (ie large variance in the response variable for given input). Here a Bayesian regression (or even classification) model wouldn’t magically remove the uncertainty but rather you’d be able to account for the predictive variance (instead of being none-the-wiser using just good ole deep learning). You can then take it from there how you wish to treat the predictions, given the predictive variance as well.

borroka · on Dec 23, 2020

The choice is not between Bayesian methods and Deep Learning, but between statistical models and machine learning models (say, from random forest to GBM to xgboost and then maybe Deep Learning). There is overlap between statistical models and machine learning models—it is a matter sometimes of focus—and Bayesian methods can also be applied to what are typically considered ML approaches (see for example Bayesian hierarchical random forest).

lhomdee · on Dec 24, 2020

But are machine learning models not statistical models? There is sample data which is statistics, and the objective function is also statistics, eg mean square error or negative log-likelihood or ELBO. And if you’re using stochastic gradient descent or a form of it, then that has statistical properties.

I don’t see any clear distinction between machine learning and statistics. Machine learning is a type of statistical model which relies on iterative optimization.

Bayesian inference on the other hand is a specific type of statistical model where the aim is to model distributions, not just the output variable (which is what ML is traditionally focused on).

And yes there is overlap, you can take a Bayesian approach to machine learning and that can make total sense sometimes.

borroka · on Dec 25, 2020

"Bayesian inference on the other hand is a specific type of statistical model where the aim is to model distributions, not just the output variable (which is what ML is traditionally focused on)." - What distribution are you referring to? One of the advantages of the Bayesian approach (in the context of models and not of probability, it is not a model, it is a way of estimating the values of the parameters of a model) is that it provides a proper statistical distribution—and not a distribution based on theoretical formulas that require certain assumption to be true to have certain properties—of parameters and model predictions.

You can read more at https://www.fharrell.com/post/stat-ml/ (Frank Harrell is a top statistician who was once a frequentist and now is a bayesian. He writes also on the differences between ML and SM and how to choose between the two)

kj98uo · on Dec 23, 2020

I am still learning about Bayesian inference so this might be off-base but isn't the point to compute the full posterior distribution (or an approximation thereof) of the underlying parameters. Whether this is done in the context of a linear model or a deep neural network is a question of tractability.

The other distinction is between discriminative and generative models. In a discriminative model, the output/label is being predicted based on the input features: p(y|x, theta). For example, the probability of an image containing a dog, y based on pixels, x. Theta here refers to the parameters one needs to discover.

In a generative model, one instead models the distribution p(x|y, beta) i.e. given the label, say dog, predicting the joint distribution of all the images.

Neural networks with backproagation can be used for both discriminative and generative models. Bayesian methods can be applied to both discriminative and generative models to compute the full posterior distribution of the parameters, theta and beta.

Edit for clarity: The claim is that the choice of the model vs the choice of inferential methodology (Bayesian vs max likelihood for example) are orthogonal choices.

A neural network doing (discriminative) binary classification based on cross-entropy is maximizing likelihood instead of maximizing the posterior. Most Bayesian examples seem to specify a generative model (a Hidden Markov Model for example) and then infer the posterior. But there's nothing preventing one from using Bayesian methods with discriminative models (generalized linear models) or max likelihood with generative models.

gbrown · on Dec 23, 2020

This question would be super bizarre to anyone coming from a stats background.

Others have commented on the role of inference/estimation, and prediction in small data or non-black-box contexts, so I’ll just add that there are deep theoretical reasons to do Bayesian inference. It’s a framework grounded firmly in decision theory, and provides a coherent way to reason about the world. You can prove, under sensible axioms, that beliefs can be described in terms of probability distributions, and that we should update beliefs based on Bayes’ Rule.

abeppu · on Dec 23, 2020

I like many of the answers to your question. But a refinement of your question is when do we really have to choose between Bayesian inference and deep learning? Under what conditions should one pick Stan over Edward or Pyro?

eggie5 · on Dec 23, 2020

Managing uncertainty with Distributions instead of point estimates

ogogmad · on Dec 23, 2020

I asked essentially the same question as you 5 minutes before you did. Have an upvote anyway.

[Edit] I don't understand these downvotes.

ogogmad · on Dec 23, 2020

Anonymous passive aggressive downvoting cowards go to hell.

deugtniet · on Dec 23, 2020

I've dabbled in Stan, and it's really good and state of the art for Bayesian inference. Starting using Stan is a bit difficult though, as it has a C like programming language that is difficult to master initially. Especially since statistics is usually done in languages like R, so the learning curve is a bit steep for beginners.

I've personally liked PyMC for simple models and relative ease of inference, as it's more integrated with the Python language. That being said, if you want the latest in inference methods and statistical alchemy, Stan is the place to go.

phillc73 · on Dec 23, 2020

There are a good range of other programming language interfaces to Stan. The R one is quite popular.[1]

You do still need the C++ toolchain, but can just write your code in R.

[1] https://mc-stan.org/rstan/

standevbob · on Dec 23, 2020

Stan requires models to be coded in the Stan language, which is a simple imperative language that's like MATLAB with explicit data types. This is the same as was done in Stan's predecessors, BUGS and JAGS.

A Stan program can be run in any of our interfaces in Python, Julia, R, MATLAB, Stata, etc. But you can't mix any of those languages into a Stan program.

The C++ toolchain is required because Stan transpiles its programs to C++, then compiles those against the Stan math librarym, which does autodiff. But you don't need to write any C++ to use Stan, just to develop extensions for it.

melling · on Dec 23, 2020

There’s a Julia Stan too:

http://stanjulia.github.io/Stan.jl/stable/INTRO.html

https://astrostatistics.psu.edu/su14/lectures/BayesComp2014L...

dstick · on Dec 23, 2020

My name is Stan and I just finished a feature on the product I’m working on that used statistical analysis for anomaly detection. So this headline made me smile - thanks for sharing, and apologies for a rather pointless comment otherwise ;-)

harry8 · on Dec 23, 2020

Named for Stanislaw Ulam. https://en.wikipedia.org/wiki/Stanislaw_Ulam

Were you?

mhh__ · on Dec 23, 2020

The name of the main developer of Microsoft's C++ library is a never-ending source of fun

mushufasa · on Dec 23, 2020

Does anyone know of a good article of a comparison between Stan vs PyMC3 for real-world bayesian modelling tasks? E.g. to be used in a production system.

PLenz · on Dec 23, 2020

Stan is one of those technologies I keep finding is actually powering the more 'friendly' interfaces I run to one off jobs - especially in the mcmc world. Every so often I think I'll spend some time to learn stan proper but it's such an all-encompassing project that I get intimidated and stick to the derivatives. My loss!

Bravo to the team behind it and for making and supporting such a powerful tool!

pvitz · on Dec 23, 2020

By looking at the user's guide, it seem that Stan has also other use cases than Bayesian inference. Examples are linear regression, mixture models and even ODEs. Does anybody here have experience with Stan and R and could comment on the strengths of Stan in non-Bayesian contexts?

standevbob · on Dec 23, 2020

Stan provides both frequentist inference (penalized maximum likelihood with bootstrapped confidence intervals) and Bayesian inference (MCMC sampling or approximate variational) inference.

As currymj says, the differential equations (same for all the linear algebra solvers like eigendecomposition) can be used in defining likelihoods for either Bayesian or frequentist estimation. Same for all of our linear algebra operations and special functions.

Not every model that can be programmed in Stan has a well-defined MLE or proper posterior. Standard hierarchical/multilevel models don't have MLEs, even with standard shrinkage. Bayesian models with improper priors and no data wind up with improper posteriors, etc.

Having said all that, almost all of the use of Stan is for Bayesian inference.

ChrisRackauckas · on Dec 23, 2020

DiffEqBayes.jl can transpile Julia ODE code to Stan. This is a nice interface to use Stan directly from Julia, and also makes it easy to benchmark the ODE inference in a bunch of PPLs. Some benchmarks:

https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqB...

Crye · on Dec 23, 2020

I don't know what your experience with Bayesian modeling is, and I'll admit mine is limited, but STAN can solve linear regression by defining up a linear model and then setting the dimension parameters as a normal distribution to solve. This is great because it gives you a measure of certainty for each of your parameters.

currymj · on Dec 23, 2020

I believe the main reason for including ODE solvers is basically to do Bayesian parameter estimation of ODEs from data.

likewise as far as I know, linear regression and mixture models are both done in a Bayesian style (a hierarchical model giving priors for parameters).

elsherbini · on Dec 23, 2020

Stan uses MCMC (specifically NUTS, which is a Hamiltonian Monte Carlo sampler) to optimize parameter fitting, so it can be used for things like ODEs.

Here is an example from a class taught last January that uses stan to fit a simple ODE (using the `integrate_ode_rk45` function in stan):

https://github.com/gregbritten/BayesianEcosystems_IAP/blob/m...

nlpNick · on Dec 23, 2020

You may find the `rstanarm` package interesting/useful. I've used it for linear regression and HLMs. https://mc-stan.org/rstanarm/articles/index.html

elsherbini · on Dec 23, 2020

Does anyone know what a practical upper limit is for Stan in terms of size of data set / number of parameters to fit? Can you use stan to fit a model with ~10^4 parameters and ~10^6 rows of data if you had access to ~10^3 cores? How long would it take?

standevbob · on Dec 23, 2020

Yes, we regularly use Stan's MCMC to fit relatively simple time-series regression models or item-response theory type models with 10^5 parameters and 10^6 rows of data on a desktop computer. It can take a day, though. It's much faster with variational inference, but that can be less stable and it doesn't give you the same uncertainty quantification because of the way the KL-divergence is ordered in the objective.

Stan can parallelize multiple chains and it can parallelize the density/gradient calculations in a single chain. But for the latter to be efficient, the chunks being parallelized need to be compute intensive, like you might get in a pharmacometric compartment model where you might have to solve a bunch of differential equations for each of thousands of patients in a clinical trial.

hendzen · on Dec 23, 2020

Stan used to only be able to parallelize across chains but they introduced within-chain parallelism this year. Even then some of the work is still serial so I don’t think you can expect linear speedups past a certain point.

gbrown · on Dec 23, 2020

I imagine it depends on which algorithm you’re using - maybe with their VB functionality (I almost entirely use the souped up NUTS algorithm for full Bayesian inference).

It also likely depends on how well conditioned your model is - even if you can get it to run for huge models on reasonable hardware, convergence may not be practical.

ogogmad · on Dec 23, 2020

Obligatory question: What applications does Stan have? I'm aware of Facebook's Prophet model for time series prediction, but what about others?

I find there's a lot of excitement around Bayesian inference and MCMC, but I do wonder about the substance.

usgroup · on Dec 23, 2020

Bayesian modellers are typically scientists and Stan probably has more indirect than direct users. For example, rstanarm and BRMS are both R regression packages which use Stan which are wildly popular in Bayesian circles. They enable hierarchical Bayesian modelling which can be used to perform a very flexible kind of regression which allows the integration of lots of prior information, and better quantification of uncertainties than previous alternatives.

j7ake · on Dec 23, 2020

Bayesian methods are great if you want to squeeze all the information you get from each of your data points, and also injecting specific prior information to help prevent over fitting.

These methods are ideal for small datasets with correlation structures that aren’t necessarily independent.

Also great if you want uncertainty with your estimates.

standevbob · on Dec 23, 2020

This is the main reason that people use Stan---squeezing as much info out of your data as possible. That and the ability to write custom models for these situations.

There are hundreds of different applications of Stan across the physical, biologial, and social sciences, as well as in finance, education, sports analytics, actuarial sciences, transportation planning, all sorts of material and chemical and civil engineering, clinical trials and pharmacometrics, etc. etc. It's most popular in fields like ecology and epidemiology where Bayesian methods are already popular. For instance, many of the Covid models (like the one for NY state) are being built with Stan. All four baseball teams in the semifinals (LCS) use Stan for analytics, for example. Google and Facebook use Stan for ad attribution and resource allocation. It's been used for models of neutrino mass and models of galactic mass, models of supernovas, and it's even used in the LIGO gravitational wave experiments.

noelsusman · on Dec 23, 2020

It's really good for hierarchical models. I used it this year to model PPE usage for a large health system. It let me easily share information across hospitals and embed knowledge of how different PPE items interact with each other. As always, there are other ways to accomplish this, but it felt natural in Stan.

stdbrouw · on Dec 23, 2020

Any kind of statistical modeling that doesn't fit neatly into an existing "one (meta)model to rule them all" framework such as generalized linear models.

usgroup · on Dec 23, 2020

I don't think that's accurate. Sampling based approaches scale badly with data so, although there are a few exceptions, if you're tackling the problem as a hierarchical Bayesian model - which is most often what Stan is used for - you're working with a dataset with a small number of features and fewer than 10k rows.

Stan fittings can be made parallel, some models will scale linearly with the data, but in the main you won't find many big data use cases here.

You also can't use Stan for online learning.

standevbob · on Dec 23, 2020

That's right---Stan doesn't have any online learning facilities. It's very hard to approximate posteriors and chain them, so we don't try.

If by "big data", we're talking about too big to fit in memory, that's right. Stan's fully in-memory. Compute can be distributed and GPU-powered for matrix ops, but all of the data and parameters and the core autodiff expression graph need to fit in memory.

For "medium data", Stan's adaptive Hamiltonian Monte Carlo sampling is much more efficient and scalable to complex models and higher dimensions than Gibbs or Metropolis. I'm fitting a Covid prevalence model using a custom trend-following and mean-reverting second-order autoregression model over 400 distinct regions with weekly data that has 5M data points and 10K parameters and adjusts for sensitivity and specificity of various tests taken. It fits in a single thread using MCMC in 24 hours or so, but we can fit the model with variational inference in a couple minutes. Although variational inference often produces reasonable point estimates in bigger data settings, it doesn't reasonably quantify uncertainty. I'm also working on a genomics model for differential expression of splice variants that involves 120K measurements and just as many parameters to deal with overdispersion of biological replicates in a control and treatment group. We're using variational inference and it fits in a couple minutes for the comparitiver event probabilities we need to estimate.

harperlee · on Dec 23, 2020

> You also can't use Stan for online learning.

Can’t you loop posteriors as next iterations priors to get a system that learns online?

ploika · on Dec 23, 2020

You can, but it's slow and computationally intensive - you still need to be fairly sure that the sampler has converged on the true posterior (I might be a bit off with the terminology there but you know what I mean) before that can become your new prior.

usgroup · on Dec 23, 2020

Stan will get your from a prior to a posterior distribution that is best supported by your data, but typically the posterior and prior distributions will not be of the same form, so there's no loop back to make.

In the case that your model is extremely simple such that your posterior has a "conjugate prior" (i.e. the posterior and prior are the same family of distribution), this sort of loop back is possible. But where this is possible you have no reason at all to use Stan or MCMC since you can just update your posterior directly.

harperlee · on Dec 23, 2020

"Typically" depends on your problem, right? So if I know that I want to feedback predictions into the next iteration, I need to take care to structure a model that enables that, by having posteriors with same shape as priors. But from what I understand it seems a design consideration, not a fundamental limitation.

RA_Fisher · on Dec 23, 2020

Check out stan’s variational inference algos. They’re relatively fast (compared to MCMC) at the cost of being approximative.

stdbrouw · on Dec 23, 2020

Fair enough. I read the word "application" as "field of inquiry" where I do think the sky is the limit, but it's true that Stan is primarily geared towards scientific work with small data sets.

elsherbini · on Dec 23, 2020

This was useful. Do you know how painful it would be to use Stan with 100k rows, or even 1m? (For a sorta normal hierarchical model)

usgroup · on Dec 23, 2020

Under the hood Stan attempts to find globally optimal parameter values for your function which you've expressed as a joint probability density. To do this it relies on the same MCMC theoretical results which indicate how the recursive process of sampling and posterior updating leads to the global optimum. The big deal about Stan is that its algorithm for doing this is state-of-the-art, and that it can work with a huge variety (including custom) density functions by utilising auto-differentiation.

Sampling is a slow approach when there are other alternatives. For example, if you are after OLS regression, you can do the equivalent with Stan but it may be an order of magnitude slower than plain OLS. Further, the calculation of your likelihood function will scale linearly with the size of the data. But adding new parameters will scale exponentially, so you may find that a model with 2 free parameters which takes 10 minutes to fit takes 2 hours with 3 parameters.

A good thing about Stan however, is that it is parallelisable so you can run it on many cores (and it will scale linearly for a good while) and you can also run it on MPI across many machines. Some regression functions with very large matrices support GPUs (although Stan requires double precision to work). So to some extent you can "throw more money at it" to get a result out and it has been used for very big data problems in astronomy for example which however utilised something like 600k cores if memory serves correctly.

standevbob · on Dec 23, 2020

Stan supports optimization (L-BFGS) to find (penalized) maximum likelihood or MAP estimates where they exist. Bayesian estimates are typically posterior means, which involve MCMC rather than optimization, and the result is usually far away from the maximum likelihood estimate in high dimensions. I wrote a case study with some simple examples here: https://mc-stan.org/users/documentation/case-studies/curse-d...

Adding new parameters scales as O(N^5/4) in HMC, whereas it scales as O(N^2) in Metropolis or Gibbs. It's quadrature that scales exponentially in dimension. There's also a constant factor for posterior correlation, which can get nasty. I regularly fit regressions for epidemiology or genomics or education with 10s or even 100s of thousands of parameters on my notebook with one core and no GPU.

MCMC or optimization can be sub-linear or super-linear in the data, depending on the statistical properties of the posterior. Some non-parametric models like Gaussian processes can be cubic in the data size, whereas regressions are often sub-linear (doubling the data doesn't double computation time) because posteriors are better behaved (more normal in the Gaussian sense) when there's more data and hence easier to explore in fewer log density and gradient evaluations.

mark_l_watson · on Dec 23, 2020

I will give PyStan a try over the holidays. Also, many thanks for all of the great comments here (statistical modeling is not in my toolbox yet, and the discussion here is grounding).