1) it is pretty amazing that normit transformations (map the quantiles of a non-normal distribution onto a Gaussian and use that) don't seem to be on this guy's radar. We use distributions with linearly additive and affine invariant properties (normal plus normal is normal, bernoulli plus bernoulli is bitwise bernoulli) because we find linear algebra very useful. Nonparametric tests and procedures erode your power; normit transformations usually increase it. I did part of my dissertation on this; it's partly to do with asymptotics, but also partly due to the robustness of Gaussian error assumptions thanks to the CLT.
I realized recently that a lot of the trouble people have with training neural networks stems from their lack of training in basic model evaluation. If you stack a bunch of shitty penalized regressions (which is what linear/logistic + relU hinge loss represents) you now have one gigantic shitty regression which is harder to debug. If your early steps are thrown out of whack by outliers, your later steps will be too. Dropout is an attempt to remedy this, but you tend to lose power when you shrink your dataset or model, so (per usual) there really is no such thing as a free lunch. But most of the tradeoffs make more sense when you are able to evaluate each layer as a predictor/filter. Scaling this up to deep models is hard, therefore debugging deep models is hard. Not exactly a big leap.
There is a reason people say "an expert is a master of the fundamentals". Building a castle on a swamp gives poor results. If you can't design an experiment to test your model and its assumptions, your model will suck. This is not rocket surgery. A GPU allows you to make more mistakes, faster, if that's what you want. If you have the fundamentals nailed down, and enough data to avoid overfitting, then nonlinear approaches can be incredibly powerful.
Most of the time a simple logistic regression will offer 80-90% of the power of a DNN, a kernel regression will offer 80-90% of the power of a CNN, and an HMM or Kalman filter will offer 80-90% of the power of an RNN. It's when you need that 10-20% "extra" to compete, and have the data to do it, that deeper or trickier architectures help.
If you can transform a bunch of correlated data so that it is 1) decorrelated and 2) close enough to Gaussian for government work, you suddenly get a tremendous amount of power from linear algebra and differential geometry "for free". This is one reason why Bayesian and graphical hierarchical mixed models work well -- you borrow information when you don't have enough to feed the model, and if you have some domain expertise, this allows you to keep the model from making stupid or impossible predictions.
Anyways. I have had fun lately playing with various deep, recurrent, and adversarial architectures. I don't mean to imply they aren't tremendously powerful in the right hands. But so is a Hole Hawg. Don't use a Hole Hawg when a paper punch is all you really need.
2) What (good) statisticians excel at is catching faulty assumptions. (I'll leave it to the reader to decide whether this data-scientist-for-hire has done a good job of that in his piece) So we plot our data, marginally or via projections, all the damned time. If you don't, sooner or later it will bite you in the ass, and then you too can join the ranks of the always-plotting. However, choosing which margins or conditional distributions to plot in a high-dimensional or sparse dataset is important to avoid wasting a lot of time. So whether via transformation or penalization (e.g. graphical lasso) or both, we usually try to prune things down and home on on "the good stuff". Prioritizing what to do first is most easily done if you have a number and can rank the putative significance by that number. Use Spearman, use energy statistics (distance correlation), use marginal score tests -- IDGAF, just use these as guidelines and plot the damned data.
Corollary: if someone shows you fancy plots and never simple ones containing clouds of individual data points, they're probably full of shit. Boxplots should be beeswarms, loess plots should have scatterplots (smoothed or otherwise) behind them. And for god's sake plot your residuals, either implicitly or explicitly.
3) see above. The author is good at fussing, and brings up some classical points. But they're not really his points. Median and MAD are more robust to outliers than mean and standard deviation, but that makes them less sensitive, too. Check your assumptions, plot everything, use the numbers as advisory quantities rather than final results.
> it is pretty amazing that normit transformations (map the quantiles of a non-normal distribution onto a Gaussian and use that) don't seem to be on this guy's radar
I don't mean to offend, but this is the PhD ur-response, "you didn't mention my pet theory!" :-)
You've given me some interesting stuff to chew on but I very specifically wanted to write about descriptive statistics as a way to describe data, not as a way to summarize it for computers so it can be used in inference. Mapping non-normal distributions onto a Gaussian ain't gonna cut it for that purpose, and to the extent that I care about robustness in this context it's not robustness of inference but whether a descriptive continues to provide a reasonable description of the data for human consumption in the face of outliers etc.
re: normit/inverse transformation: Pet theory? This isn't a theory, it's a simple inverse transformation. It's used all the time. George Box showed how to do a 2D version of it in the 1950s. Normit may be a pet name for it, though. It's just a play on words (probit, logit, expit... normit).
As far as describing data, what's wrong with median + IQR for marginal distributions, or some flavor of energy statistic for joint distributions? You will always need to trade off robustness for sensitivity and bias for variance. That's simply a mathematical feature of the universe. There are plenty of ways to take advantage of this to highlight outliers, for example, which often gets you thinking in terms of "hey, this really looks more like a mixture of two completely different distributions" and seeing if that intuition holds up.
The whole point of describing data with summary statistics is that if the assumptions are met this decouples them from the underlying data. If you use the median as an estimator for the center of your distribution and the MAD as an estimator of its scale, you may choose dimensions along which it's not very good at partitioning your observations. If you want a resistant way to describe the expected center of your data, the median and MAD are very useful. Sometimes it's even more useful to plot everything and point out "our results fall into K obvious clusters" each of which will have their own center & spread.
What I'm saying is that there's no silver bullet. Most of the time we take descriptions of the data, see if there are interesting inferences to be drawn, lather, rinse, repeat. "Get a lot more samples" often thrown into the mix. Are there strong clusters in the data? (Usually a plot will show this, whether via projection or in the raw data) Are there continuums that are interesting in relationship to things we care about? (Usually we'll turn around and model their relationship to said thing-we-care-about, conditioned upon a bunch of other items... multivariate regression, which if you're doing it right, will get you plotting the residuals, themselves descriptive of the model fit)
You simply can't do responsible statistical inference without exploring your data to see what's going on. In order to explore complex datasets, there are plenty of techniques, and most all of them demand tradeoffs (see MAD vs. SD or other metrics of "interestingness" for clustering). A number of descriptive statistics ("extremality" for example) rely upon limit behavior of specific distributions and are case-by-case.
I don't think you'll find many silver bullets for either descriptive or inferential statistics. You have to choose your tradeoffs based on what you want to accomplish.
Several great points (normit is the basis for the Gaussian copula, which was used to great effect to model the CDOs (collateralised debt obligations) that blew up in the GFC (global financial crisis)); but it would have been possible to raise them while being less dismissive...
Yeah, sorry about that. By the time I realized the way the tone had come off, I had managed to "noprocrast" myself off the site. I need to write better hot takes
I largely agree with what you are saying but estimating the population transformation that makes the transformed data Gaussian from a finite sample is far from trivial.
If you have any pointers to results that show distribution free guarantee of increased power I would be super happy to read.
Here's a question for you , why not just deal with the quantiles directly (for example with quantile regression for regression tasks) and not map it to the quantiles of a Gaussian ?
quantile regression is computationally intensive and inverse transformations usually less so. Although you could certainly make the case that, given enough data, quantile regression better captures what we actually want to find, most of the time (i.e. how's this effect diverge towards the extremes).
Normit typically (not sure if universally) has the lovely property of giving you something like a marginal t-test without the assumptions of mixture-of-gaussians errors. You don't jerk around with U-statistics and thus the sample size doesn't make the test statistics so damned granular.
Thanks for responding, we are definitely in agreement. I have used both, in my experience quantile regression seems better behaved at the tail than quantile transformation in regression tasks. I think if one can make the transform conditional on the covariate they would be comparable.
Logistic won't do anything useful for text, to be sure, although an HMM often will (or if you have continuous-valued sequences, a Kalman filter often will do the same). Logistic or multinomial can be tremendously handy for picking up interactions between measurements that can be followed up on and/or expanded in the limited-data case.
I think that the nonlinearity is what really sets apart problems better handled by NNs (not just nonlinearity, but nonlinearity that resists any sort of linearizing transformation), even for lowish-dimensional data. If you look at a linear fit plus a relU, you're just tacking a hinge loss onto a linear/logistic fit. Stack a bunch of these on top of each other and you have a universal function approximator, for which the goodness of fit is limited by the data. If you don't have a crapton of data, the fit isn't likely to be a lot better than linear or transformed linear. If you do have a crapton of data with nonlinear relationships, the implicit structure can be better captured by the flexibility of an NN. But of course you can also spend a lot of time training and debugging the fit, when it might be possible to quickly fit and diagnose a low-dimensional linear or additive model and put it into production. For a long, long time, the most popular "machine learning" method in the valley was logistic regression :-)
For image classification the way people expect it to be done, you are absolutely right (CNNs are incredibly good at this when given enough labeled data). E.g. for relating histological images to genetics or other markers, there's almost no point in not using a CNN with or without a denoising autoencoder in front of it. For low-detail or sparse-and-low-rank mixtures, often you can use compressive approaches to get a lot faster training. But I'll not argue against CNNs for the general image recognition case.
Linear or logistic is typically a great start, and as you note, it's very general. If after trying the simplest thing that can possibly work (linear or logistic), you need better performance, or the linear models can't give you useful answers, ratcheting up the complexity is a reasonable response. You do need a good deal of data to make the latter step worthwhile in most cases. I see a lot of people skipping the first step or ignoring the need for lots of data, and these are the people who get in trouble.
Yeah totally, and I usually fall back on "classic" methods like linear and forest algorithms. It's always good to remind myself of how many domains ML is applicable to, not just image and text analysis which seems to be the hot topic of the day.
Sorry, I should have used the phrase "inverse cdf". If you can map a pile of data onto 0 to 1 (i.e. by ranking it) then you can invert it onto the values that occupy those quantiles for a Gaussian. This is nice because you can use quasi-parametric assumptions (very nice when multi-dimensional relationships are to be explored) without the data itself needing to be distributed appropriately. Other alternatives are to use nonparametric smoothers and the like but I kind of hate going to all that trouble when it's usually pointless. (important exception: when you have interesting correlated structure in high-dimensional data)
It's a hack, to be sure, but especially if you want to pool data (e.g. in mixed hierarchical models) for better predictions, it often pays off. The name is a play on "logit", "expit", "probit", "tobit", etc. since the actual transformation is relatively trivial for data that is already sorted. (For large unsorted data or streams, not so much)
I realized recently that a lot of the trouble people have with training neural networks stems from their lack of training in basic model evaluation. If you stack a bunch of shitty penalized regressions (which is what linear/logistic + relU hinge loss represents) you now have one gigantic shitty regression which is harder to debug. If your early steps are thrown out of whack by outliers, your later steps will be too. Dropout is an attempt to remedy this, but you tend to lose power when you shrink your dataset or model, so (per usual) there really is no such thing as a free lunch. But most of the tradeoffs make more sense when you are able to evaluate each layer as a predictor/filter. Scaling this up to deep models is hard, therefore debugging deep models is hard. Not exactly a big leap.
There is a reason people say "an expert is a master of the fundamentals". Building a castle on a swamp gives poor results. If you can't design an experiment to test your model and its assumptions, your model will suck. This is not rocket surgery. A GPU allows you to make more mistakes, faster, if that's what you want. If you have the fundamentals nailed down, and enough data to avoid overfitting, then nonlinear approaches can be incredibly powerful.
Most of the time a simple logistic regression will offer 80-90% of the power of a DNN, a kernel regression will offer 80-90% of the power of a CNN, and an HMM or Kalman filter will offer 80-90% of the power of an RNN. It's when you need that 10-20% "extra" to compete, and have the data to do it, that deeper or trickier architectures help.
If you can transform a bunch of correlated data so that it is 1) decorrelated and 2) close enough to Gaussian for government work, you suddenly get a tremendous amount of power from linear algebra and differential geometry "for free". This is one reason why Bayesian and graphical hierarchical mixed models work well -- you borrow information when you don't have enough to feed the model, and if you have some domain expertise, this allows you to keep the model from making stupid or impossible predictions.
Anyways. I have had fun lately playing with various deep, recurrent, and adversarial architectures. I don't mean to imply they aren't tremendously powerful in the right hands. But so is a Hole Hawg. Don't use a Hole Hawg when a paper punch is all you really need.
2) What (good) statisticians excel at is catching faulty assumptions. (I'll leave it to the reader to decide whether this data-scientist-for-hire has done a good job of that in his piece) So we plot our data, marginally or via projections, all the damned time. If you don't, sooner or later it will bite you in the ass, and then you too can join the ranks of the always-plotting. However, choosing which margins or conditional distributions to plot in a high-dimensional or sparse dataset is important to avoid wasting a lot of time. So whether via transformation or penalization (e.g. graphical lasso) or both, we usually try to prune things down and home on on "the good stuff". Prioritizing what to do first is most easily done if you have a number and can rank the putative significance by that number. Use Spearman, use energy statistics (distance correlation), use marginal score tests -- IDGAF, just use these as guidelines and plot the damned data.
Corollary: if someone shows you fancy plots and never simple ones containing clouds of individual data points, they're probably full of shit. Boxplots should be beeswarms, loess plots should have scatterplots (smoothed or otherwise) behind them. And for god's sake plot your residuals, either implicitly or explicitly.
3) see above. The author is good at fussing, and brings up some classical points. But they're not really his points. Median and MAD are more robust to outliers than mean and standard deviation, but that makes them less sensitive, too. Check your assumptions, plot everything, use the numbers as advisory quantities rather than final results.