I think the message of the article is great: move beyond the "standard" descriptions and pay more attention to what you're trying to show and who your audience is.
That said, it's a slight pet peeve of mine when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs." (That said, a mean of 1.9 legs actually seems much more informative to me than a median of 2 legs, so even in that case I prefer the mean.) It's easy to envision many situations where the median's representation is wildly inaccurate, just as you can imagine ways in which the mean can be misleading.
The median is also insensitive to skew, which is often cited as a good thing, but really that's something you should determine on a case-by-case basis. In many (most?) applications, there's no tangible benefit in having a measure of center that ignores skew. The median's insensitivity also creates strange situations where subgroups in the population end up unrepresented (e.g. if the poorest 20% become even poorer because of changes to the tax code, the median doesn't budge). In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else. The mean certainly has its issues, but it's far from inferior. Use the median when the situation calls for it, certainly, but realize that it's just as limited as any other measure, if not more so.
> when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage
I think this swings the other extreme in selling the 'median' short. As long as we agree that it is only strictly meaningful to talk about the 'center' for symmetric distributions, median does a fine job.
In fact in many realistic scenarios a far better job than the mean. The main trouble is the normal or the Gaussian distribution is no where close to being as ubiquitous as it seems, neither is CLT as universal as it is made out to be. Gauss sort of got away with it, Gauss did not discover the distribution nor the associated CLT.
Many real data of day to day consequence have heavy tails, and mean is a pathetic measure of 'central tendency' for these. Mean is particularly sensitive to outliers. Median does significantly better than the mean in this non-academic situation. Although one could do better than median for symmetric heavy tailed data (for example trimmed means), but its mean that I find guilty of entirely disproportionate fame. If its needed to exaggerate median a bit to get people to grow up beyond the pervasive Normal / Gaussian fetish, I am all for it.
No single number is going to characterize what is 'typical'. One really needs the CDF here, and yes avoid estimating densities as much as possible.
BTW mapping to a real observation is not true, you only get a 50% chance of that.
The median is worse than the mean in a skewed distribution if you want to take the skew into account. In fact one of the strengths of the mean is that it is sensitive to changes in the entire sample, rather than only part of it. To repeat my previous example, if the lowest 20% of household incomes drop because of changes to the tax code, that's something I'd generally want reflected in my "household income" statistic. We can't get too hung up on the semantics of central tendency here--there are many good ways to measure center, and it's clear that all of these statistics actually measure very different things even though they're all grouped into the same category of statistic.
When people say the median is "better" than the mean, they often can't explain exactly why (other than to cite skew, which I've addressed). It's really just a gut feeling based on how they felt when they first learned that the median was "more realistic" for some specific set of data that was skewed. They never thought through exactly why or when you'd want to disregard the skew in the first place. I would be insane to argue that the median is never the right statistic to use; my point is merely that it shouldn't be given preference over the mean, and that people need to think about what they're actually trying to show.
The mean is not only useful for the normal distribution, and since the article specifically avoids jumping into inference territory talking about specific distributions at all is kind of putting the cart before the horse. The "disproportionate fame" of the mean is inherited from its extreme importance throughout all of statistical inference. It's true that if you ignore inference, the mean isn't particularly important as a descriptive statistic, but my argument is that it's certainly no worse than the median.
>one of the strengths of the mean is that it is sensitive to changes in the entire sample, rather than only part of it.
Yes but this is exactly what is bad with it. At least for the majority of practical applications that I can think of. For instance, if you want to measure the "average" income of the population. If there are a handful of uber-rich people, they will really raise the mean. Even if the vast majority of people make much less than the mean. The median will do a much better job of telling you what a "typical" person makes.
I really believe that when most people read or talk about "average", they mentally interpret it as the median. And that the median is generally less misleading.
Sure, there are applications where sensitivity to changes at the tails is important. But even then, mean is misleading and difficult to interpret. In your earlier example, if you raised the income of the bottom 10% by 100%, then just use that as a statistic. "The poorest 10% of people now make twice as much money" is much more informative and impressive than "the average income increased by 0.1%."
> When people say the median is "better" than the mean, they often can't explain exactly why (other than to cite skew, which I've addressed).
Well, 40~50 years of literature on robust statistics happens to disagree with the claim 'people' don't know what they mean (pun unintended) when the say the median is better. Furthermore skew has less to do with that argument than heavy tails. heavy tails are extremely (ok sorry, now its an insider pun) common.
i agree with the article in that: if you want to take the screw into account, you look at the histogram. as a single number, I dont think the mean tells you anything about the skew/tail of the data any better than the median.
It's a better than 50% chance - if the sample size is odd, you always get a real observation; if the sample size is even, you may still get a real observation (if the two median observations are equal).
But I think the author was careful to sprinkle caveats so as to avoid universal recommendations. Rather, it take on a few common abuses.
About the mean vs. median, it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse. There are times when neither are appropriate, but if you're using median, you're more likely to understand the merits of different contexts.
I agree with the article as a whole. I just personally believe that the median is generally worse to use than the mean, so I believe that if you're going to recommend the median over the mean, you should provide guidelines on when to do so. I know not everyone agrees with me on this (as evidenced by some responses in these comments) but that's the perspective I'm coming from.
> it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse.
Sure, but this is only true because the mean is used so much more. If the whole world was instead taught in grade school to use nothing but the median, there would be just as many (I'd argue more) misuses as there are now, except this time the median would be the offender. People using the median are more likely to be using it inappropriately because it's the nonstandard option, but that's only because they're not using it blindly. If people starting using the median just because, it would suffer the same issue.
>The median [...] just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs."
If you listen to someone like Taleb, the main advantage they mention for the median is that it's a more robust statistic. For fat-tailed distributions the average can jump all over as new data come in.
> In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else.
The median also has zero ability to make me coffee in the mornings, but I don't think I can hold that against it.
I can imagine situations where getting people to track or listen to even a single number is tough and using the mean as a measure of both central tendency and the stability of the distribution over time might be the least worst option. But is that really a common problem? Have you often encountered situations where it was impossible to communicate something like "the typical customer buys five widgets but more than 1 in 4 of our customers only buy one" because it contains two statistics and management insists on being briefed with just one?
Clearly the best approach is to directly answer the question at hand with the most relevant numbers/statistics available. I'm all for that. I'd argue you'd still want to use the mean more than the median, but it's not really important at that point because you're painting the most complete picture you can.
There are times that, for whatever reason, someone is only presenting one statistic. News headlines are a big one. In this case, you clearly want to pick the "best" statistic available. There are people who think that the median is categorically better than the mean and dispense advice as such. (You can find them saying things like "the mean is worthless, the median would be much better" in the comments of discussion boards.) That's not what this article did, but the author did imply (in its title and language, if nothing else) that the median is better than the mean without offering any sort of weighing mechanism for which to choose. That's what I'm responding to, because it perpetuates the median > mean myth that's prevalent among certain groups of people.
Is it possible to sensibly generalize the concept of median to "higher orders"?
E.g., I can imagine the difference between the 25th and 75th percentile to be descriptive of spread (like standard deviation), but those numbers seem arbitrary.
"""The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation"""
Only if you have an odd number of data points or the two middling data points have the same value. Example: I have four people with these sizes in cm: 120, 160, 180, 200...the median would be (160+180)/2=170 which is the size of none of the people.
Yes that's true. I was being somewhat charitable in affording it that benefit, but I did so because the article is the one that brought up the 1.9 legs example so that was kind of an implied benefit. And it's also true that you can use a "median" that picks one of the two, instead of splitting them, without any real downside if you want to preserve the feasibility of the statistic.
With a small number of categories a full summary of the data easily fits into text. It might still be desirable to report a summary statistic, but omitting the totals for 2 or 3 categories and reporting a summary statistic is pretty sloppy.
That said, it's a slight pet peeve of mine when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs." (That said, a mean of 1.9 legs actually seems much more informative to me than a median of 2 legs, so even in that case I prefer the mean.) It's easy to envision many situations where the median's representation is wildly inaccurate, just as you can imagine ways in which the mean can be misleading.
The median is also insensitive to skew, which is often cited as a good thing, but really that's something you should determine on a case-by-case basis. In many (most?) applications, there's no tangible benefit in having a measure of center that ignores skew. The median's insensitivity also creates strange situations where subgroups in the population end up unrepresented (e.g. if the poorest 20% become even poorer because of changes to the tax code, the median doesn't budge). In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else. The mean certainly has its issues, but it's far from inferior. Use the median when the situation calls for it, certainly, but realize that it's just as limited as any other measure, if not more so.