1) it is pretty amazing that normit transformations (map the quantiles of a non-...

stdbrouw · on Feb 1, 2017

> it is pretty amazing that normit transformations (map the quantiles of a non-normal distribution onto a Gaussian and use that) don't seem to be on this guy's radar

I don't mean to offend, but this is the PhD ur-response, "you didn't mention my pet theory!" :-)

You've given me some interesting stuff to chew on but I very specifically wanted to write about descriptive statistics as a way to describe data, not as a way to summarize it for computers so it can be used in inference. Mapping non-normal distributions onto a Gaussian ain't gonna cut it for that purpose, and to the extent that I care about robustness in this context it's not robustness of inference but whether a descriptive continues to provide a reasonable description of the data for human consumption in the face of outliers etc.

apathy · on Feb 1, 2017

re: normit/inverse transformation: Pet theory? This isn't a theory, it's a simple inverse transformation. It's used all the time. George Box showed how to do a 2D version of it in the 1950s. Normit may be a pet name for it, though. It's just a play on words (probit, logit, expit... normit).

As far as describing data, what's wrong with median + IQR for marginal distributions, or some flavor of energy statistic for joint distributions? You will always need to trade off robustness for sensitivity and bias for variance. That's simply a mathematical feature of the universe. There are plenty of ways to take advantage of this to highlight outliers, for example, which often gets you thinking in terms of "hey, this really looks more like a mixture of two completely different distributions" and seeing if that intuition holds up.

The whole point of describing data with summary statistics is that if the assumptions are met this decouples them from the underlying data. If you use the median as an estimator for the center of your distribution and the MAD as an estimator of its scale, you may choose dimensions along which it's not very good at partitioning your observations. If you want a resistant way to describe the expected center of your data, the median and MAD are very useful. Sometimes it's even more useful to plot everything and point out "our results fall into K obvious clusters" each of which will have their own center & spread.

What I'm saying is that there's no silver bullet. Most of the time we take descriptions of the data, see if there are interesting inferences to be drawn, lather, rinse, repeat. "Get a lot more samples" often thrown into the mix. Are there strong clusters in the data? (Usually a plot will show this, whether via projection or in the raw data) Are there continuums that are interesting in relationship to things we care about? (Usually we'll turn around and model their relationship to said thing-we-care-about, conditioned upon a bunch of other items... multivariate regression, which if you're doing it right, will get you plotting the residuals, themselves descriptive of the model fit)

You simply can't do responsible statistical inference without exploring your data to see what's going on. In order to explore complex datasets, there are plenty of techniques, and most all of them demand tradeoffs (see MAD vs. SD or other metrics of "interestingness" for clustering). A number of descriptive statistics ("extremality" for example) rely upon limit behavior of specific distributions and are case-by-case.

I don't think you'll find many silver bullets for either descriptive or inferential statistics. You have to choose your tradeoffs based on what you want to accomplish.

FabHK · on Feb 1, 2017

Several great points (normit is the basis for the Gaussian copula, which was used to great effect to model the CDOs (collateralised debt obligations) that blew up in the GFC (global financial crisis)); but it would have been possible to raise them while being less dismissive...

apathy · on Feb 2, 2017

Yeah, sorry about that. By the time I realized the way the tone had come off, I had managed to "noprocrast" myself off the site. I need to write better hot takes

ced · on Feb 2, 2017

Please do. That was a lot of interesting material, and the tone was unfortunate.

srean · on Feb 1, 2017

I largely agree with what you are saying but estimating the population transformation that makes the transformed data Gaussian from a finite sample is far from trivial.

If you have any pointers to results that show distribution free guarantee of increased power I would be super happy to read.

Here's a question for you , why not just deal with the quantiles directly (for example with quantile regression for regression tasks) and not map it to the quantiles of a Gaussian ?

apathy · on Feb 2, 2017

quantile regression is computationally intensive and inverse transformations usually less so. Although you could certainly make the case that, given enough data, quantile regression better captures what we actually want to find, most of the time (i.e. how's this effect diverge towards the extremes).

Normit typically (not sure if universally) has the lovely property of giving you something like a marginal t-test without the assumptions of mixture-of-gaussians errors. You don't jerk around with U-statistics and thus the sample size doesn't make the test statistics so damned granular.

srean · on Feb 2, 2017

Thanks for responding, we are definitely in agreement. I have used both, in my experience quantile regression seems better behaved at the tail than quantile transformation in regression tasks. I think if one can make the transform conditional on the covariate they would be comparable.

eanzenberg · on Feb 1, 2017

Re: the linear methods vs. neural networks

It depends on the domain. Logistic or most other classifiers won't get close to NN when classifying images or text. It's not 80-90% of the power.

You are right when dealing with data that is not highly-dimensional and not very non-linear either. Also plenty of other domains..

apathy · on Feb 2, 2017

Logistic won't do anything useful for text, to be sure, although an HMM often will (or if you have continuous-valued sequences, a Kalman filter often will do the same). Logistic or multinomial can be tremendously handy for picking up interactions between measurements that can be followed up on and/or expanded in the limited-data case.

I think that the nonlinearity is what really sets apart problems better handled by NNs (not just nonlinearity, but nonlinearity that resists any sort of linearizing transformation), even for lowish-dimensional data. If you look at a linear fit plus a relU, you're just tacking a hinge loss onto a linear/logistic fit. Stack a bunch of these on top of each other and you have a universal function approximator, for which the goodness of fit is limited by the data. If you don't have a crapton of data, the fit isn't likely to be a lot better than linear or transformed linear. If you do have a crapton of data with nonlinear relationships, the implicit structure can be better captured by the flexibility of an NN. But of course you can also spend a lot of time training and debugging the fit, when it might be possible to quickly fit and diagnose a low-dimensional linear or additive model and put it into production. For a long, long time, the most popular "machine learning" method in the valley was logistic regression :-)

For image classification the way people expect it to be done, you are absolutely right (CNNs are incredibly good at this when given enough labeled data). E.g. for relating histological images to genetics or other markers, there's almost no point in not using a CNN with or without a denoising autoencoder in front of it. For low-detail or sparse-and-low-rank mixtures, often you can use compressive approaches to get a lot faster training. But I'll not argue against CNNs for the general image recognition case.

Linear or logistic is typically a great start, and as you note, it's very general. If after trying the simplest thing that can possibly work (linear or logistic), you need better performance, or the linear models can't give you useful answers, ratcheting up the complexity is a reasonable response. You do need a good deal of data to make the latter step worthwhile in most cases. I see a lot of people skipping the first step or ignoring the need for lots of data, and these are the people who get in trouble.

eanzenberg · on Feb 2, 2017

Yeah totally, and I usually fall back on "classic" methods like linear and forest algorithms. It's always good to remind myself of how many domains ML is applicable to, not just image and text analysis which seems to be the hot topic of the day.

baq · on Feb 1, 2017

i googled "normit transformations". the only non-publication link on the first page was (i can only assume) an automatic japanese translation...

apathy · on Feb 2, 2017

Sorry, I should have used the phrase "inverse cdf". If you can map a pile of data onto 0 to 1 (i.e. by ranking it) then you can invert it onto the values that occupy those quantiles for a Gaussian. This is nice because you can use quasi-parametric assumptions (very nice when multi-dimensional relationships are to be explored) without the data itself needing to be distributed appropriately. Other alternatives are to use nonparametric smoothers and the like but I kind of hate going to all that trouble when it's usually pointless. (important exception: when you have interesting correlated structure in high-dimensional data)

It's a hack, to be sure, but especially if you want to pool data (e.g. in mixed hierarchical models) for better predictions, it often pays off. The name is a play on "logit", "expit", "probit", "tobit", etc. since the actual transformation is relatively trivial for data that is already sorted. (For large unsorted data or streams, not so much)