> In ML, you have a clear mechanism to check estimation/prediction through holdo...

> In ML, you have a clear mechanism to check estimation/prediction through holdout approaches.

To be clear, you can overfit while your validation loss does not decrease. If your train and test data are too similar then no holdout will help you measure generalization. You have to remember that datasets are proxies for the thing you're actually trying to model, they are not the thing you are modeling themselves. You can usually see this when testing on in class but out of train/test distribution data (e.g. data from someone else).

You have to be careful because there are a lot of small and non-obvious things that can fuck up statistics. There's a lot of aggregation "paradoxes" (Simpsons, Berkson's), and all kinds of things that can creep in. This is more perilous the bigger your model too. The story of the Monte Hall problem is a great example of how easy it is to get the wrong answer while it seems like you're doing all the right steps.

For the article, the author is far too handwavy with causal inference. The reason we tend not to do it is because it is fucking hard and it scales poorly. Models like Autoregressive (careful here) and Normalizing Flows can do causal inference (and causal discovery) fwiw (essentially you need explicit density models with tractable densities: referring to Goodfellow's taxonomy). But things get funky as you get a lot of variables because there are indistinguishable causal graphs (see Hyvarien and Pajunen). Then there's also the issues with the types of causalities (see Judea's Ladder) and counterfactual inference is FUCKING HARD but the author just acts like it's no big deal. Then he starts conflating it with weaker forms of causal inference. Correlation is the weakest form of causation, despite our often chanted saying of "correlation does not equate to causation" (which is still true, it's just in the class and the saying is more getting at confounding variable). This very much does not scale. Similarly discovery won't scale as you have to permute so many variables in the graph. The curse of dimensionality hits causal analysis HARD.