As someone who works with causal inference most days, I expected to find much fault with this article. I was pleasantly surprised to find how rigorous the article is, despite some other comments here. For more information on the role of randomization in causal inference (experimental and observational), I recommend the books by Paul Rosenbaum, especially, "Observation and Experiment".
> When we randomise people, these influences will still be operating on the outcome, which will vary across the people randomised to our conditions. Does randomisation mean that all these different effects are balanced somehow? No – not least because confounders do not exist in experimental studies! This is for the simple reason that a confounder is something that affects both the exposure and the outcome, and in an experimental (i.e., randomised) study we test for a difference in our outcome between the two randomised groups.
I do not think this sort of word-play is useful. If your random samples are small (and even if statistically adequate) the chance of confounders in one or both groups of an A/B can be relatively high, even though the selection procedure for treatment is random. So, "No – not least because confounders do not exist in experimental studies!" is misleading if it is expected on the basis that randomisation of treatment allocation somehow makes confounding impossible.
That the possibility of confounding is equally likely in both branches remains true for all sample sizes of an A/B where #A=#B and allocation is random. So, in my opinion, not a myth.
The author is using the technical definitions of confounders and covariates without sufficient explanation, and the technical definitions do not match the normal English definitions.
In English, a confounder is any factor that distorts an observation. (My dictionary defines it as throwing into confusion or disarray.)
In causal inference, a confounder is a factor that is correlated with both treatment and outcome. If the treatment is randomly assigned, by construction it is independent of all other factors. This, there can be no confounders.
Your example is about observed occurrences of imbalance, but the technical definition is about probabilities. Observed imbalances can still skew inference, but that causes high variance (or low precision). It doesn't cause bias (or affect accuracy).
Adjusting for observed imbalances can reduce variance, but in some circumstances can actually cause bias.
Picking a specific definition of a word, expounding its consequences, and then referring to the colloquial usage of the word as a “myth” is just word-play as I said.
Many do think of confounders in an experimental context as just those effects which correlate with both outcome and treatment. The non-sequitur — barring a specific definition — is concluding that since nothing can correlate with random allocation, confounders are impossible by construction.
Why impossible? Because we are talking about the probability of allocation, not the actual allocation, and confounding does not refer to the result. We’d instead say there are imbalanced covariates, but that’s ok because randomisation converts “imbalance into error”. Yet, the covariates may be unknown, and without taking measurements prior to the treatment, how are we supposed to know whether the treatment itself or just membership of the treatment group explains the group differences?
Had we not tested the samples prior to treatment, the result would be what many would call “confounded” by the differences in the samples prior to treatment.
The best available defense against the possibility of spurious results due to confounding is often to dispense with efforts at stratification and instead conduct a randomized study of a sufficiently large sample taken as a whole, such that all potential confounding variables (known and unknown) will be distributed by chance across all study groups and hence will be uncorrelated with the binary variable for inclusion/exclusion in any group.
Sure, the chance of (unbalanced) “confounders” can be high in a small sample. But the statistical machinery you’re using is designed to handle that. If you try to avoid it then you’re violating the underlying assumptions of that statistical machinery, no?
It doesn't work that way, and it is common to test for known imbalances, or to implement stratified sampling. As another poster said, confounders do not have to be conveniently distributed, even though that is typically assumed. It could be that a big effect is present in a confounder but collected only every N samples because it is sparse. In which case you could have large, randomly allocated but confounded samples.
All these things can and do happen in randomised experiments, but it is still orders of magnitude more interpretable than what can happen in observational studies.
Sure, if you correctly implement stratified sampling that’s fine. But then you’ve replaced the statistical machinery with one that takes this balancing into account.
If on the other hand you just redo the sampling until it seems balanced then you’ve violated the assumptions behind the standard statistical tooling.
If the confounders are fat-tail distributed etc. then arbitrarily large samples can still be inadequate.
The idea that even thousands a data points in subgroups are going to be 'well mixed' relies on extremely strong assumptions about the distribution of those traits.
I'd love to read similar pointers on sampling in the stream: unknown population size, semi continuous sampling. The limits of your inference and also it's strength.
> This approach has given us the oft-repeated (but slightly fallacious) line that 'correlation does not imply causation' (I would say that it can imply it, just often not much more).
I always thought that this expression used the word "imply" in its mathematical sense, i.e. "correlation is not a sufficient condition for causation". Did I get that wrong, or did the author?
> If our randomisation mechanism influences our exposure (which by definition it should) and nothing else (ditto), and we see a difference in our outcome, then this difference must have been caused by the exposure we manipulated.
Just false. It's kinda sad and funny, I guess, that this article exists on a psychology site.
The outcome can be caused by uncontrolled factors, and randomisation is a very limited technique for causal control.
Eg., suppose traits T1...Tn are distributed throughout a population in a powerlaw fashion (eg., Zipf), so that Ti is had by (1/2^i)%. Suppose each has an effect on the relevant outcome of some non-trivial %.
Then even for tens of thousands of people, you can still get "random" divisions of a population where, say, one half has all the T9 traits -- and thereby the "effect" is just due to this random segmentation.
In the hard sciences causes can actually be controlled, by eg., literally placing your hand on some part of the experiment to stop it moving (or equivalent). This is the "third option" which is missing from the intro: actual science.
When causes cannot be controlled, all inferences are highly provisional (, defeasible, suspect, ...). And i'd prefer we used a whole different set of methods, terminology, etc. in this case -- "speculative science" or some such
> Then even for tens of thousands of people, you can still get "random" divisions of a population where, say, one half has all the T9 traits -- and thereby the "effect" is just due to this random segmentation.
You have chosen a very high-variance distribution for the trait, so in this experiment also the sampling error would also have very high variance - big enough to capture this effect you are talking about. The article mentions this. Random sampling does not guarantee we don't get a false positive, but it lets us quantify the probability of a false positive, and pick an acceptable risk of false positive.
This probability depends on the variance of the underlying trait: a high-variance trait takes an impossibly large sample to get the same discriminatory strength as a sample of a low-variance trait.
> In the hard sciences causes can actually be controlled, by eg., literally placing your hand on some part of the experiment to stop it moving (or equivalent).
Things can be controlled in softer sciences also, it's just that it's sometimes hard to agree on which the meaningful causal antecedents are to get results that replicate out of sample. We can run into that case in hard sciences also, and we do from time to time – the typical example is how insufficient control of unknown factors caused Newton to believe relativistic effects didn't exist.
I fail to see how your "third option" is fundamentally different from the two options already specified in the article.
In my example, the point is we do not known the trait distribution -- so how are you accounting for it?
You don't see a difference between actually necessarily controlling a cause by causal intervention, vs., "somehow, hopefully, on average" possible relevant causes are controlled for?
One way is with modeling assumptions. Recall that the T test was invented to be used in controlled beer fermentation experiments, which I think should fall into your "hard" science category. But the T test has strict distributional assumptions, without which the test is not valid and its results are meaningless.
Another way is that we can improve our modeling assumptions, or at least our interpretation of modeled results, by doing basic descriptive data analysis before experimenting. Unless you are extremely starved for data, you will very quickly notice if your data follows something like a power law distribution. Then you can adjust your experimentation and modeling accordingly, even if it's just tempering your ability to draw conclusions for a randomized controlled trial.
Yet another way is with replication of experiments. We are not talking about observational data here, so these experiments can be replicated. A power-law distribution, again, would be noticeable here, and we'd see a high variance in results across individual experiments.
> You don't see a difference between actually necessarily controlling a cause by causal intervention, vs., "somehow, hopefully, on average" possible relevant causes are controlled for?
I don't, because "actually necessarily controlling a cause by causal intervention" does not exist in real life. It's just a matter of how much the data varies around the average case, which tends to be greater in the social sciences than in the natural sciences... on average, with plenty of variation around that average.
In a sample of 10,000, the number who have trait T9 is about 20. The chance of all 20 being in a randomly chosen half of the sample is about 1 in 500,000. It's quite unlikely that trait T9 is randomized this way, and it's also quite unlikely that the effects of just 20 T9 individuals on the outcome of interest are so strong that a typical analysis would find a significant difference between the treatment and control arms. The chance of both these things being simultaneously true is negligible. Do I misunderstand you?
The issue is that we have T1..Tn in an individual, so there's a very large number of ways you can get one group to have confounders.
The role of the powerlaw is to imply that the generative process which distributes these traits isnt "nice", so that one group can easily get a T9 that the other group doesnt have, and so on, for all T1...Tn
So you have this, let's say adversarial, background generative process which is giving you these confounding traits but never enough of each that you get nice mixtures.
You could see it as a problem of uniform sampling across many powerlaw factors to deliver uniform distributions of those factors. I havent written a simulation, but I don't see why this wouldnt be a serious problem for randomisation.
With a large sample size, it's astronomically unlikely for there to be a confounding trait that is important to the outcome and is also widespread in only one of the experimental arms. If you were to write a simulation showing the issue I might be able explain more specifically why I think the simulation doesn't reflect reality.
> Then even for tens of thousands of people, you can still get "random" divisions of a population where, say, one half has all the T9 traits -- and thereby the "effect" is just due to this random segmentation.
This complaint reveals a lack of understanding about statistics and probability. The point is that in a well-controlled experiment, the random variation is uncorrelated with the treatment. Nobody ever claimed, or ever would claim, that control removes random variation entirely, or that it's impossible to randomly obtain a data sample that looks like a causal effect.
Trying to deal with this problem is literally why the entire field of probability and statistics exists. It is not at all a valid criticism of this article or the causal reasoning behind randomized controlled trials. The article even talks about this (indirectly and clumsily) in the section about "balancing covariates".
Recall the definition of a p-value: the (estimated) probability that a result as extreme as the one we observed could arise simply due to random variation in sampling, measurement noise, etc?
> In the hard sciences causes can actually be controlled, by eg., literally placing your hand on some part of the experiment to stop it moving (or equivalent). This is the "third option" which is missing from the intro: actual science.
No, you still need to do statistical analysis of "hard science" lab experiments too, for exactly the same reason. Experimental control is never perfect, and literally every observed data point ever is a random draw from a noisy data-generating process.
> speculative science
There is a lot to criticize in how science, especially (but not exclusively) social science, handles statistical rigor and the communication of results that depend on statistical analysis. But this is not that.
In the hard sciences, it's often possible to isolate the phenomenon of interest away from any other influencing factors, e.g. in a laboratory. But many phenomena, like social interactions, or even agriculture, are difficult to isolate in this way. Randomization provides another way of "zooming in" on the treatment of interest.
In the example you gave, a test is going to have very low power because of the important factor with huge variance. If that factor is observed, you can create pairs of units with that factor identical within the pair, then randomly assign treatment to one unit in each pair.
The traits I was talking about are unobserved/unknown. As mechanisms become more complex from the genetic to cellular to social, powerlaws appear more often and the number of unobservables grows at least super-linearly. On these grounds I think that there really isn't any kind of social science possible.
This is a complex topic, but it's a bit simpler when outcomes are bounded, such as a binary outcome that either occurs or does not occur. In that case, the impact of any one factor is bounded.
In the scenario you're describing, this other factor drowns out any influence the treatment has on the outcome. You'll struggle to get a statistically significant result (low power) and the confidence interval on the treatment effect will include 0. This too can be a valuable finding: sometimes the answer is that the treatment is not particularly effective.
Given the practical predictability of at least some (broadly) social phenomena and interventions (or in other areas with large "factor surfaces"), not sure why any kind of social science is impossible as such. Maybe some things are out of reach, but that would hold for other sciences, too.
It's extraordinary that an article of this kind beings, "When we are interested in cause and effect relationships ... we have two options" and omits the option most characteristic of science.
Given there's a tremendous amount of reputational damage that has been done to science by those who have omit from their practice of science, science, I don't have much patience for this omission.
If one wants to educate an informed reader on the scientific method, you ought begin with a setup of the "problem of science" (that of causes, effects and their controls) that makes it clear that these far less reliable methods are indeed, far less reliable.
What this article does, instead, is claim the opposite. It omits the ideal case where science is possible, then proceeds to claim a status for randomisation (as a method) far above what it's capable of --
"Science" is not a thing in itself, or it's at the very least heavily polysemous and vague. It's a collection of institutions, processes, methodologies, and cultures that can sometimes, under the right conditions produce greater collective certainty about the universe. It's not a magic wand that works great when measuring physical processes but goes flaccid when empirical measurement gets difficult. In fact, that's why it's such a valuable concept and is clearly distinct from "magic".
Just because you prefer "hard science" because it's easier to control variables doesn't give you license to push your pet definition (notably, not provided) and value-judgements about the word onto other people. (Or at least—taking this license destroys your own credibility.) Doing so does just as much reputational damage to the aforementioned institutions, processes, methodologies, and cultures than people who try to draw too much certainty from poorly controlled variables.
What ever happened to nuance and understanding? C'mon! I believe you're capable of better. This kind of rancid tone has no place in serious discussion.
A hard science is one where you can control causes, or have an equivalent operationalizable theory of causal mechanisms, or there is a feasible strategy for the development of either.
Science becomes a half-empirical speculative activity in all other cases. So yes, a lot of genetics (but not all), and so on, is a speculative science. Eg., very rarely some finite number of known genes can be given a known causal mechanism (etc.) -- in this cases, you have a Science.
This seems to be more exposing that “hard science” is an arbitrary label which is not especially useful. Genomics doesn’t magically become non-empirical because you don’t control all of the variables - empiricism is about observation, not control – and all science has some level of speculation because the entire concept is based on collecting evidence to test theories. Just because a biologist or psychologist works with more complex problems doesn’t change that underlying mechanism.
When geneticists are studying eg., single-gene-to-single-disease relationships then this is science. When they're studying possible trajectories of 1000s of genes on down-stream phenomneon with an amazing number of uncontrollable causes.. then i'd be inclined to call this pseduoscience.
The line isn't arbitrary at all, but based on hard facts about the nature of reality and our ability to measure it.
Modelling the weather is a science if it is done based on extremely recent history, and the models are explanatory and accurate to within a relevant time horizon. If you use the same models to predict the weather next year, that's pseduoscience.
Pseudoscience is often deeply apparently scientific -- resuing all the same statsitical formulae, etc. -- this is a charade. It is reality which decides that these techniques are broken, not the techniques themselves.
Sure, and you’re welcome to have personal preferences. My point is just that “hard science” sounds like an objective term but it’s neither well-defined nor objective. Calling it “specialties which mjburgess likes” would be just meaningful.
Correct. But the problem for these areas is that they cannot operationalise "works", because there's no causal theory. There's literally just "measure this, measure that, assume some unknown mechainms with extremely strong properties that we have no evidence for ..."
So we're in a very very bad situation in these cases. It isnt that mere speculation is invovled as an input, its that mere speculation is the output.
Not really my experience when helping on, for example, gene array data analyses: there's often a lot of causal theory (or at least hardened hypotheses) developed going into the experiments. People don't just randomly measure things.
Sorry, but what is an "operationalizable theory of causal mechanisms" other than the ability to control causes?
I guess, there is rather little hard science done in the world then/a lot of science is speculative in your view (maybe only most of mathematics would survive if it were a science).
There's a lot of science: chemistry, physics, some biology -- etc.
Mathematics isnt a science at all, since its a study of number not of cause: there is no inferential gap between measures and their causes in mathematics, because there are no measures.
> operationalizable theory of causal mechanisms
A theory whose terms refer, at some point, to measurable variables that can in some contexts be controlled. So, eg., you might have a theory of light which tells you necessarily how to adjust for some lighting effect in some experiment, even if you cannot control it in that experiment.
The key relation that Science has is Necessity, since it is a study of cause. If your "science" has no means of obtaining necessary relata between physical objects, then it's not a science.
> there is no inferential gap between measures and their causes in mathematics, because there are no measures
That's not true. For example in researching prime numbers, mathematicians carry out simulations and produce observational results from those simulations, without having a fully-proven formal theory to support the results.
So not being able to control a cause in an experiment is fine then - how is that not speculative science in that experiment if the theory isn't a theory of everything (and it still could be wrong!)?
I'm not trying to equate science with "perfect knowledge". I'm simply saying that if you open a physics textbook you'll find causal theories. These cannot, often, be tested in a wide variety of experimental setups. Instead they are used to design the experiments, and sometimes, post-facto, adjust the measures based on these theories.
This is perfectly fine. And it does make science in some minimal literal sense "speculative".
My issue is with an extreme depature from this method. Where you open a psychology textbook and there are no causal theories, no quantification of the number and nature of causes, and hence no way of designing experiments with controls, and so on.
This kinds of activity may, at some great distance, resemble each other -- but one is capable of reliably producing knowldge (even if in small quantity), and the other is not. That latter mechanism produces, imv, at least as much wholesale nonesense.
To get to causal theories from hypotheses needs a lot experiments that were somewhat speculative then. And, yes, there are well and poorly designed experiments (and analyses) but I don't think the speculative nature prior to having a theory is what necessarily separates them.