> It also includes moving the goal posts: that is, mining the data for results f...

imh · on June 17, 2015

Let's say I flip a million different fair coins 20 times each. Then I analyze my findings and see that coin number 54 of them was heads all 20 times. I present my results to my peers saying, "Coin #54 is defective! The chances of this happening by chance are on in a million!" But I'd expect one of them to be heads every time flipping that many coins. It just happened to be coin #54.

The problem is we're not matching up our statistical analysis to our testing when we look for findings then naively test hypotheses based on those very findings.

Nadya · on June 17, 2015

"4 out of 5 [__PROFESSION__] agree. [__BRAND__] is the best!"

Statistics can be true and false at the same time. Depending on interpretation of the data and how you try and 'spin it'.

This is why I'm wary of any study that only shows their statistics, but does not share their testing methods.

vacri · on June 18, 2015

Tangential, speaking of coin-flipping and statistics: https://en.wikipedia.org/?title=Two-up

Two-up ("Come in, spinner!") is a coin-flipping gambling game where people bet on whether a punter will come up double-heads or double-tails. The house makes it's money when neither come up five times in a row (with some variations).

ikeboy · on June 17, 2015

Would using Bayesian statistics fix that? If the prior for a coin being defective to produce only heads is more than one in a million, then 54 is probably defective. If it's less than that, then your analysis doesn't conclude it's defective anyways.

Bayes can be hacked by ommitting info, but not by things like this, I believe.

imh · on June 17, 2015

It's not a bayesian versus NHST thing, it's just about doing the right test. You can do this kind of thing easily enough in a NHST kind of way. But yeah I think doing it in a bayesian manner is much easier to interpret and explain, and so requires less training, which is a HUGE boon for the crisis at hand.

ikeboy · on June 17, 2015

You can do the test with NHST only if you know what the researcher had in mind when experimenting. That can lead to absurd results at times, such as your example. With Bayes, as long as no lies are told, your inference doesn't depend on the researcher's private thoughts.

imh · on June 17, 2015

You only need to know how the experiment was performed, same as you do for bayes. If they present the whole dataset, nothing is changed. If they cherrypick and only show the data for the coin that happened to be all heads saying that was their whole experiment, no amount of stats will help you, bayesian or otherwise.

ikeboy · on June 18, 2015

No, if the coins are independent, the Bayesian is not fooled even when seeing only that coin. The argument was in my post above.

The inference in Bayes only depends on the data, and the flips of other coins doesn't make a difference if independent. Frequentist testing can depend on things like stopping rules and hypothesis tested, which aren't correlated to the actual truth and therefore should have zero effect on inference.

They can cherry pick and only show some flips of that coin, but then they really need to be outright lying or you'll ask why only some flips were reported.

tjradcliffe · on June 18, 2015

imh's explanation is good, but it's also worth considering a very common example of this. Consider two groups of people where one group engages in some activity and another does not. Measure a dozen things about those people. Then report only on the one that positively (or negatively) correlates with group membership as if it was the only thing you measured.

This generates an unrealistically high statistical significance (unrealististically low p-value) because the question should not be, "What are the odds these two things are correlated by chance" but "What are the odds that any of of these things is correlated by chance with this other thing?"

The significance of the result is far below the reported p-value from an analysis that assumes only one thing was investigated. You frequently see incorrect p-values of this kind even when the researchers acknowledge they tested for multiple things.

InclinedPlane · on June 18, 2015

Because it conflates fit and prediction. If I take a graph of the DJIA over the 20th century then fit a polynomial to it it's intellectually dishonest to turn around and claim that I have a predictive model of the stock market. I have a fitted function, nothing more.

This is similar to the sharpshooter fallacy. If I shoot at the side of the barn then draw a bullseye wherever I hit I can pretend to be a good shot.

pessimizer · on June 17, 2015

https://duckduckgo.com/?t=lm&q=p-hacking

https://blogs.plos.org/scicomm/2015/05/19/p-hacking-megan-he...

fsloth · on June 18, 2015

Because if one takes a large dataset with many variables and starts mining for correlations it's roughly equal to trying to get a random number generator to produce numbers in some specific pattern. More numbers, the likelier the result.

This is the reason experiment repeatability should be considered of crirical importance especially for 'gooey' disciplines like medical and psychological research where there is little more than statistics to go with. Was the experiment just mininf random noise for correlation - 'simple', just repeat the experiment. If results can't be repeated it's more likely the original authors just botched the experiment, massaged their data to look how they wanted or just got lucky with a random number sequence.

TheLoneWolfling · on June 17, 2015

Relevant:

https://xkcd.com/882/