> It also includes moving the goal posts: that is, mining the data for results first, and then writing the paper as if the experiment had been an attempt to find just those effects. “You have exploratory findings, and you’re pitching them as ‘I knew this all along,’ as confirmatory,” Dr. Nosek said.
Why is this a problem? If the experiment's design is not in conflict with the new findings, why complain?
Let's say I flip a million different fair coins 20 times each. Then I analyze my findings and see that coin number 54 of them was heads all 20 times. I present my results to my peers saying, "Coin #54 is defective! The chances of this happening by chance are on in a million!" But I'd expect one of them to be heads every time flipping that many coins. It just happened to be coin #54.
The problem is we're not matching up our statistical analysis to our testing when we look for findings then naively test hypotheses based on those very findings.
Two-up ("Come in, spinner!") is a coin-flipping gambling game where people bet on whether a punter will come up double-heads or double-tails. The house makes it's money when neither come up five times in a row (with some variations).
Would using Bayesian statistics fix that? If the prior for a coin being defective to produce only heads is more than one in a million, then 54 is probably defective. If it's less than that, then your analysis doesn't conclude it's defective anyways.
Bayes can be hacked by ommitting info, but not by things like this, I believe.
It's not a bayesian versus NHST thing, it's just about doing the right test. You can do this kind of thing easily enough in a NHST kind of way. But yeah I think doing it in a bayesian manner is much easier to interpret and explain, and so requires less training, which is a HUGE boon for the crisis at hand.
You can do the test with NHST only if you know what the researcher had in mind when experimenting. That can lead to absurd results at times, such as your example. With Bayes, as long as no lies are told, your inference doesn't depend on the researcher's private thoughts.
You only need to know how the experiment was performed, same as you do for bayes. If they present the whole dataset, nothing is changed. If they cherrypick and only show the data for the coin that happened to be all heads saying that was their whole experiment, no amount of stats will help you, bayesian or otherwise.
No, if the coins are independent, the Bayesian is not fooled even when seeing only that coin. The argument was in my post above.
The inference in Bayes only depends on the data, and the flips of other coins doesn't make a difference if independent. Frequentist testing can depend on things like stopping rules and hypothesis tested, which aren't correlated to the actual truth and therefore should have zero effect on inference.
They can cherry pick and only show some flips of that coin, but then they really need to be outright lying or you'll ask why only some flips were reported.
imh's explanation is good, but it's also worth considering a very common example of this. Consider two groups of people where one group engages in some activity and another does not. Measure a dozen things about those people. Then report only on the one that positively (or negatively) correlates with group membership as if it was the only thing you measured.
This generates an unrealistically high statistical significance (unrealististically low p-value) because the question should not be, "What are the odds these two things are correlated by chance" but "What are the odds that any of of these things is correlated by chance with this other thing?"
The significance of the result is far below the reported p-value from an analysis that assumes only one thing was investigated. You frequently see incorrect p-values of this kind even when the researchers acknowledge they tested for multiple things.
Because it conflates fit and prediction. If I take a graph of the DJIA over the 20th century then fit a polynomial to it it's intellectually dishonest to turn around and claim that I have a predictive model of the stock market. I have a fitted function, nothing more.
This is similar to the sharpshooter fallacy. If I shoot at the side of the barn then draw a bullseye wherever I hit I can pretend to be a good shot.
Because if one takes a large dataset with many variables and starts mining for correlations it's roughly equal to trying to get a random number generator to produce numbers in some specific pattern. More numbers, the likelier the result.
This is the reason experiment repeatability should be considered of crirical importance especially for 'gooey' disciplines like medical and psychological research where there is little more than statistics to go with. Was the experiment just mininf random noise for correlation - 'simple', just repeat the experiment. If results can't be repeated it's more likely the original authors just botched the experiment, massaged their data to look how they wanted or just got lucky with a random number sequence.
Why is this a problem? If the experiment's design is not in conflict with the new findings, why complain?