What's often missing from these introductions is when statistics will not work; and what it even means when it "works". The amount of data needed to tell between two normal is about 30 data points -- between two power-law distributions, >trillion. (And this basically scuppers the central limit theorem, on which a lot of cargo-cult stats is justified).
Stats, imv, should be taught simulation-first: code up your hypotheses and see if they're even testable. Many many projects would immediately fail at the research stage.
Next, know that predictions are almost never a good goal. Almost everything is practically unpredictable -- with a near infinite number of relevant causes, uncontrollable.
At best, in ideal cases, you can use stats to model a distribution of predictions and then determine a risk/value across that range. Ie., the goal isnt to predict anything but to prescribe some action (or inference) according to a risk tolerance (risk of error, or financial risk, etc.).
It seems a generation of people have half-learned bits of stats, glued them together, and created widespread 'statistical cargo-cultism'.
The lesson of stats isnt hypothesis testing, but how almost no hypotheses are testable -- and then what do you do
"Simulation first" is how I did things when I worked in data science and bioinformatics. Define the simulation that represents "random", then see how far off the actual data is using either information theory or just a visual examination of the data and summary statistic checks. That's a fast and easy way to gut check any observation to see if there is an underlying effect, which you can then "prove" using a more sophisticated analysis.
Just raw hypothesis is just too easy to juke by overwhelming it with trials. Lots of research papers have "statistically significant" results, but give no mention of how many experiments it took to get them, or any indiciation of negative results. Eventually, there will always be the analysis where you incorrectly reject the null hypothsis given enough effort.
>The amount of data needed to tell between two normal is about 30 data points
What are you trying to say here? If there are two normal distributions, both with variance one, one having mean 0 and the other having mean 100, and I get a single sample from one of the distributions, I can guess which distribution it came from with very high confidence. Where did the number 30 come from?
This really resonates with me. I've attempted self-study about statistics many times, each time wanting to understand the fundamental assumptions that underlie popular statistical methods. When I read the result of a poll or a a scientific study, how rigorous are the claimed results, and what false assumptions could undermine them?
I want to build intuitions for how these statistical methods even work, at a high level, before getting drowned in math about all the details. And like you say, I want to understand the boundaries: "when statistics will not work; and what it even means when it "works".
I imagine that different methodologies exist on a spectrum, where some give more reliable results, and others are more likely to be noise. I want to understand how to roughly tell the good from the bad, and how to spot common problems.
It's ironic that this ... rant? ... is basically unreadable without knowledge of basic statistical methods.
How do you teach any of this to someone who hasn't already taken introductory statistics? How do you learn anything if you first have to learn the myriad ways something you don't even have a basic working knowledge of can fail before you learn it?
The comment is addressed to the informed reader who is the only one with a hope of being persuaded on this point.
To teach this, from scratch, I think is fairly easy -- but there's few with any incentive to do it. Many in academia wouldnt know how, and if they did, would discover that much of their research can be shown a priori to not be worthwhile (rather than after a decade of 'debate').
All you really need is to start with establishing an intuitive understanding of randomness, how apparently highly patterned it is, and so on. Then ask: how easy is it to reproduce an observed pattern with (simulated) randomness?
That question alone, properly supported via basic programming simulations, will take you extremely far. Indeed, the answer to it is often obvious -- a trivial program.
That few ever write such programs shows how the whole edifice of stats education is geared towards confirmation bias.
Before computers, stats was either an extremely mathematical disipline seeking (empirically useless) formula for toy models; or using heuristic empirical formula that rarely applied.
Computers basically obviate all of that. Stats is mostly about counting things and making comparisons -- perfect tasks for machines. with only a few high-school mathematical formula most could derive most useful statistical techniques as simple computer programs.
The modern approach, of which this textbook is an example, does start with simulation. In fact there is very little classical statistics (distributions, analytic tests) in the book. The Berkeley Data 8 book, which I link to in another comment, takes the same approach. I imagine there is still too much classical material for your tastes, but there is definitely change happening.
“ that much of their research can be shown a priori to not be worthwhile”
Bingo. Cargo cult stats all the way down. It’s not just personal interest, it’s the entire field, it’s their colleagues, mentors, and students. Good luck getting somebody to see the light when not just their own income depends on not seeing it, their whole world depends on the “stat recipes” handed down from granny.
I think the egotistical aspect is the most powerful: many researchers have built an identity based on the fact that they “know” something, so to propose better alternatives to their pet theories is tantamount to proposing their life is a lie. To change their mind they need to admit they didn’t “know”.
The better the alternatives, the more fierce the passion with which they will be rejected by the mainstream.
I now think it’s best explained by simple economics. Academia and academics are the product of economic forces by and large. It’s not quirky personalities or uniquely talented minds that make up academia today. It’s droves of conscientious (big five sense) conformists, with either high iq or mere socio-economic privilege, who have been trained by our society to feel that financial security means college, and even more financial security means even more college. Credentials are like alpha .05, they solve a scale problem in a way that alters the quality/quantity ratio. If you want more researchers/research/science output, credentials and alpha .05 cargo cult stats are your levers to get more quantity at lower quality.
It seems like a reasonable critique. The suggestion is to include such ideas as people are taking introductory statistics which isn’t inappropriate. I wouldn’t suggest forcing students to code up their own simulations from scratch, but creating a framework where students can plug in various formula for each population, attach a statistical test, and then run various simulations could do quite a bit. However, what kinds of formula students are told to plug in are important.
If every formula is producing bell curves then that’s a failure to educate people. 50d6 vs 50d6 + 1 is easy enough you can include 1d2 * 50 + 50d6 for a 2 tailed distribution, but also significantly different distributions which then fail various tests etc.
I’ve seen people correctly remember the formula for statistical tests from memory and then wildly misapply them. That seems like focusing on the wrong things in an age when such information is at everyone’s fingertips, but understanding of what that information means isn’t.
I work in applied ML and stats. Whenever a client gets pushy about getting a prediction and would not care about quantifying the uncertainty around it, I take it as a signal to disengage and look for better pastures. It is really not worth the time, more so if you value integrity.
Competent stakeholders and decision makers use the uncertainty around predictions, the chances of an outcome that is different from the point-predicted outcome, to come to a decision and the plan includes what the course of action should be should the outcome differ from the prediction.
Model building, at large, is the thing I regret being bad at. Model your problem and then throw inputs at it and see what you can see.
Sucks, as we seem to have taught everyone that statistical models are somehow unique models that can only be made to get a prediction. To the point that we seem to have hard delineations between "predictive" models and other "models.".
I suspect there are some decent ontologies there. But, at large, I regret that so many won't try to build a model.
> The amount of data needed to tell between ... two power-law distributions, >trillion.
I don't agree with this as a statement of fact (except in the obvious case of two power-law distributions with extremely close parameters). Supposing it was true, that would mean that you would almost never have to actually worry about the parameter, because unless your dataset is that large one power law is about as good as any other for describing your data.
Do you have anywhere I can read more about this? I would have assumed that a trillion data points would be sufficient to compare any two real-world distributions
The sample means approach the normal distribution as the sample size grows, even if the underlying distributions are not normal. That's the the central limit theorem.
(Requires some very lax assumptions like finite variance on the underlying distribution)
Stats, imv, should be taught simulation-first: code up your hypotheses and see if they're even testable. Many many projects would immediately fail at the research stage.
Next, know that predictions are almost never a good goal. Almost everything is practically unpredictable -- with a near infinite number of relevant causes, uncontrollable.
At best, in ideal cases, you can use stats to model a distribution of predictions and then determine a risk/value across that range. Ie., the goal isnt to predict anything but to prescribe some action (or inference) according to a risk tolerance (risk of error, or financial risk, etc.).
It seems a generation of people have half-learned bits of stats, glued them together, and created widespread 'statistical cargo-cultism'.
The lesson of stats isnt hypothesis testing, but how almost no hypotheses are testable -- and then what do you do