This was useful. Do you know how painful it would be to use Stan with 100k rows,...

usgroup · on Dec 23, 2020

Under the hood Stan attempts to find globally optimal parameter values for your function which you've expressed as a joint probability density. To do this it relies on the same MCMC theoretical results which indicate how the recursive process of sampling and posterior updating leads to the global optimum. The big deal about Stan is that its algorithm for doing this is state-of-the-art, and that it can work with a huge variety (including custom) density functions by utilising auto-differentiation.

Sampling is a slow approach when there are other alternatives. For example, if you are after OLS regression, you can do the equivalent with Stan but it may be an order of magnitude slower than plain OLS. Further, the calculation of your likelihood function will scale linearly with the size of the data. But adding new parameters will scale exponentially, so you may find that a model with 2 free parameters which takes 10 minutes to fit takes 2 hours with 3 parameters.

A good thing about Stan however, is that it is parallelisable so you can run it on many cores (and it will scale linearly for a good while) and you can also run it on MPI across many machines. Some regression functions with very large matrices support GPUs (although Stan requires double precision to work). So to some extent you can "throw more money at it" to get a result out and it has been used for very big data problems in astronomy for example which however utilised something like 600k cores if memory serves correctly.

standevbob · on Dec 23, 2020

Stan supports optimization (L-BFGS) to find (penalized) maximum likelihood or MAP estimates where they exist. Bayesian estimates are typically posterior means, which involve MCMC rather than optimization, and the result is usually far away from the maximum likelihood estimate in high dimensions. I wrote a case study with some simple examples here: https://mc-stan.org/users/documentation/case-studies/curse-d...

Adding new parameters scales as O(N^5/4) in HMC, whereas it scales as O(N^2) in Metropolis or Gibbs. It's quadrature that scales exponentially in dimension. There's also a constant factor for posterior correlation, which can get nasty. I regularly fit regressions for epidemiology or genomics or education with 10s or even 100s of thousands of parameters on my notebook with one core and no GPU.

MCMC or optimization can be sub-linear or super-linear in the data, depending on the statistical properties of the posterior. Some non-parametric models like Gaussian processes can be cubic in the data size, whereas regressions are often sub-linear (doubling the data doesn't double computation time) because posteriors are better behaved (more normal in the Gaussian sense) when there's more data and hence easier to explore in fewer log density and gradient evaluations.