Rule of five: Take five samples. You can be 93% confident the median is between ...

chrisseaton · on May 23, 2020

> Works for any distribution (not just normal).

How can this possibly be true?

If I take five samples of the speed of my car, and I always take them while the car is just setting off, it's never going to be anywhere near the median speed over a twelve hour drive.

I feel like there must be a huge list of extra constraints and caveats you aren't mentioning.

(I'm really bad at stats - genuinely asking.)

shoo · on May 23, 2020

There would at least be an assumption that the samples are drawn independently from a single distribution, and the estimate of the median is for that same distribution.

In the example you gave, the samples you draw are from the distribution of (the speed of your car just as it is setting off). There would be no guarantee that the estimated median has any relationship for any other distribution, such as (speed of your car during a trip). If you want to estimate the latter you'd need to figure out a way to draw random samples from that distribution.

chrisseaton · on May 23, 2020

Can you do this with a source of infinite samples? If every time I take a sample it's slightly higher, does this still hold?

JJzD · on May 24, 2020

I'm not good with stats, but five increasing measurements has two options: A. You've hit an unlikely coincidence, and you're fine B. You're not really randomly drawing from the same distribution.

Which scenario seems more likely ;)

shoo · on May 24, 2020

> If every time I take a sample it's slightly higher, does this still hold?

I don't know enough stats to give a firm answer, but I'd reckon there is a key assumption that the samples need to be drawn i.i.d. from a single underlying probability distribution, or perhaps need to satisfy the related assumption of being exchangeable.

https://en.wikipedia.org/wiki/Independent_and_identically_di...

https://en.wikipedia.org/wiki/Exchangeable_random_variables

In your example of a sequence of samples that increase, they're certainly not exchangeable. I think they're not independent either.

E.g. thought experiment to give a concrete version of your example, where we define it so there's no randomness at all to make it easier to think about : let's suppose an idealised situation where we launch a space probe that travels away from the earth at 15 km / second. Suppose we have some way of measuring the distance d(t) that probe is from earth at some time t after launch. Regard each distance measurement d(t) as a sample. Let's assume we take 5 samples by measuring the distance every 10 seconds after launch. So t_1=10s, ..., t_5=50s, and d(t_1)=150km, ..., d(t_5)=750km.

The sequence of distance samples d(t_1), d(t_2), d(t_3), d(t_4), d(t_5) is not exchangeable as if we exchange two samples like d(t_2) <-> d(t_4), the permuted sequence d(t_1), d(t_4), d(t_3), d(t_2), d(t_5) corresponds to the situation: "at 10 seconds the probe was 150 away, at 20 seconds the probe was 600 km away, at 30 seconds the probe was 450 km away, at 40 seconds the probe was 300 km away, at 50 seconds the probe was 750 km away" -- the probability of observing that outcome is an awful lot lower -- based on our understanding of how physics of the situation work in this idealised example -- than the probability of observing the outcome from the original sequence (this is pretty sloppy as I am not clearly distinguishing between observed values and random variables, but hopefully it gives some vague intuition).

So if you want to estimate the median distance of the probe from the earth from 5 samples, you roughly need to take 5 measurements at 5 times chosen to be uniformly at random from the entire period you are interested in. E.g. if you want to estimate the median distance of the probe from the earth during the first 10 years of travel, you need to draw 5 samples from 5 different times sampled from the uniform distribution over the period [0 seconds, 10 years]. Then the resulting estimated median distance would only apply for the distance of the probe during that time period, it would not be an estimate that could be applied for any different time period.

chrisseaton · on May 24, 2020

> a key assumption that the samples need to be drawn i.i.d. from a single underlying probability distribution, or perhaps need to satisfy the related assumption of being exchangeable

Unfortunately this kind of key assumption is rarely made explicit when teaching people stats. I see research papers all the time making this assumption where it clearly isn't warranted - such as in benchmarking a computer.

qznc · on May 23, 2020

From Douglas W. Hubbard, How to Measure Anything (3rd ed.) via https://lobste.rs/s/kk89vp/back_envelope_estimation_hacks#c_...

> There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.

> It might seem impossible to be 93.75% certain about anything based on a random sample of just five, but it works. To understand why this method works, it is important to note that the Rule of Five estimates only the median of a population. Remember, the median is the point where half the population is above it and half is below it. If we randomly picked five values that were all above the median or all below it, then the median would be outside our range. But what is the chance of that, really?

> The chance of randomly picking a value above the median is, by definition, 50%—the same as a coin flip resulting in “heads.” The chance of randomly selecting five values that happen to be all above the median is like flipping a coin and getting heads five times in a row. The chance of getting heads five times in a row in a random coin flip is 1 in 32, or 3.125%; the same is true with getting five tails in a row. The chance of not getting all heads or all tails is then 100% − 3.125% × 2, or 93.75%. Therefore, the chance of at least one out of a sample of five being above the median and at least one being below is 93.75% (round it down to 93% or even 90% if you want to be conservative).

chrisseaton · on May 23, 2020

Right so random samples uniformly taken across the whole set of a finite set of independent samples is the first of the extra constraints you didn't mention... which makes it not useful for many real-world computer-science applications like benchmarking, where you often have samples that are inter-dependent and from an infinite set so you can't sample them uniformly.

2019-nCoV · on May 24, 2020

You don't understand convexity. You don't walk across a river that is 4ft deep on average.