Simpson's paradox

or Why Statistics is Scary

Simpson’s Paradox is something I’ve come across here and there, but I never fully understood its implications until Peng Ding’s Causal Inference class. Once I did, I became a lot less trusting of statistical correlations.

The Paradox in Examples

There are two real-life examples of Simpson’s Paradox that Peng Ding went through. They are both present in the wikipedia page. Here is the gist:

Case 1: Kidney stone treatments

A study was conducted to compare the effectiveness of two treatments for kidney stones. The study looked at historical clinical data of people who underwent either treatment and computed their success rates. Treatment A was more aggressive than treatment B, and the doctors expected treatment A to be more effective. However, the study found that overall, people who underwent treatment A actually had lower success rates than people who underwent treatment B. What gives?

It turned out that the severity of the kidney stones affected whether doctors prescribed treatment A or treatment B. Patients who underwent treatment A tended to have more challenging conditions to begin with, and thus ended up having lower success rates. The “correct” way (based on contextual medical knowledge) to interpret the data was to first condition on disease severity, and then evaluate treatment effectiveness.

Case 2: UC Berkeley Graduate Admissions

The 1973 admissions data for the graduate schools of UC Berkeley was studied, and researchers found that women had a far lower admission rate than men. The difference was absurd (35% for women vs 44% for men). Here, admission rate simply referred to the proportion of applicants of that gender that was admitted. Reading the results of this study today, we are probably not surprised that there could be significant systematic bias against women, especially in the old days of 1973.

Interestingly, when researchers tried to identify the offending departments at UC Berkeley most responsible for the bias, they could not find any department that significantly exhibited bias against women - in fact, they found some statistical evidence that on the departmental level, some departments exhibited a small bias against men. Here, all they were doing different from the initial analysis was to look at the admission rates at a per department level. What gives?

It turned out that women, more than men, tended to apply to more selective departments. Even though individual departments were relatively unbiased, having more women apply to selective departments lowered the overall admission rates of women when aggregated over the entire school.

Simpson’s Paradox in the Abstract

In general, suppose you have a real variable $Y$ that you care about (e.g. success rates of treatment or admission rates), and some covariate $X$ (e.g. treatment type or gender). The paradox occurs when you get one trend if you analyze the dependence of $Y$ on $X$, but a completely different one if you analyze the dependence of $Y$ on $X$, conditioned on another covariate $Z$.

Mathematically, in the abstract, it seemed obvious to me that the marginal distribution $P_{X, Y}(x, y)$ could look completely different from the conditioned distributions $P_{X, Y | Z}(x, y | z)$ for any choice of $z$. That is to say, I was not surprised by the mathematical statement of Simpson’s Paradox.

However, the practical implications were what I didn’t appreciate fully. What I realize now is that any time I have a set of joint data for two variables $X$ and $Y$, and I want to do some kind of regression to find the dependence of $Y$ on $X$, I should be extremely afraid of drawing any conclusion on the real relationship between $Y$ and $X$ that exists in the underlying data generating process.

The moral of the two stories presented above is that you should pay extreme attention to the control variables, because any variable you don’t control could end up confounding the results and possibly entirely negate your entire conclusion. In these two cases, we are fortunate to have domain knowledge to understand which variables are the “natural” variables to control and hence condition on when making statistical analyses. The strength of the statistical conclusions are only as strong as the often implicit assumptions that go into modeling which variables are important enough to be controlled. In the admissions case, it seemed fitting to control for department, but what about other variables such as the level of preparation, aptitude, tenancity, or other qualities that a candidate might possess that make them a deserving admit? What if conditioning on these variables completely negate the gender bias conclusion again?

Simpson’s Paradox in OLS

Here’s a formulation of Simpson’s paradox in Ordinary Least Squares (OLS). Suppose you have a bunch of data $(X, Y, Z)$. You find that

$$Y \sim \beta_0 X + \alpha_0 Z + \epsilon$$

is the OLS fit. The $\beta_0$ you get here can be interpreted as the dependence of $Y$ on $X$ if we condition on $Z$. Contrast this with the $\beta_1$ you obtain from fitting

$$Y \sim \beta_1 X + \epsilon,$$

which you interpret as the aggregate dependence of $Y$ on $X$, without conditioning on $Z$.

It’s not difficult to imagine distributions of $(X, Y, Z)$ that lead to the Simpson’s paradox arising, i.e. having $\beta_0$ and $\beta_1$ have opposite signs. These two graphs from Wikipedia give good hints on how they might be constructed: Graph 1, Graph 2

Simpson’s Paradox in OLS, every where

I initially assumed that you need specially constructed distributions to exhibit Simpson’s paradox, but after thinking about it for a while, I realized that that any distribution of $(X, Y, Z)$ can exhibit Simpson’s paradox if we get to choose which variable to control.

The setup is as follows:

Suppose, as above, I have data $(X, Y, Z)$ and I’m interested in how $Y$ depends on $X$. My expert friend tells me that $Z$ is the important variable to control for, and so I perform the OLS fit: $$Y \sim \beta_0 X + \alpha_0 Z + \epsilon$$

I show this to my other expert friend, who then says that $Z$ is the wrong variable to condition on. What I need to condition on is the variable $$Z’ = Z + 2\frac{\beta_0}{\alpha_0}X.$$

Now, if I do the OLS fit $$ Y \sim \beta_2 X + \alpha_2 Z’ + \epsilon,$$ I will find that $\alpha_2 = \alpha_0$, but $\beta_0=-\beta_2$. (Proof in footnotes${}^1.) $ The dependence of $Y$ on $X$, under a different set of control variables, has completely reversed! Simpson’s paradox${}^2$ thus lurks in every distribution, and we have to be constantly vigilant!

In practice, if we suspect that any variable that we are unable to control for (perhaps because of experimental limitations) is strongly correlated with one of the covariates we’re studying, then we should be extremely cautious of any result that arises from statistical analyses - you might even say that the conclusions are useless.

Philosophical musings

Simpson’s paradox was presented in Peng Ding’s class to essentially motivate the entire field of Causal Inference. His presentation certainly shook me - I was aware that correlation does’t mean causation, but I didn’t know it is this bad. In addition, the complete dependence of the OLS coefficient on the choice of control variable invalidates any study that does not systematically investigate every possible confounder and justify its choice of control variables. I am more mistrustful of observational studies and also have become more cautious when applying statistics in my own work.

Particularly in the study of mechanistic interpretability, I often encounter complex deterministic systems. It’s extremely unclear in many situations where strictly speaking, everything is perfectly correlated${}^3$, to determine what are the correct variables to control, manipulate, or measure. Sometimes, I just make a choice and hope that it is good enough, but this exploration into Simpson’s paradox has made me more cautious about these decisions and how much I trust my results.

As a final word, I want to reiterate how scary doing statistics really is. In the two examples, we are fortunate to have domain expertise that gives some sense of what the “correct” control variables are or what the “correct” results should be. What if we don’t have domain expertise, or if our domain expertise contains false beliefs? To echo my takeaway from earlier:

The strength of the statistical conclusions are only as strong as the often implicit assumptions that go into modeling control variables.

Footnotes

Here’s a quick proof. Ditch the above notation. Let $Y\in \mathbb{R}^{n\times 1}$, $X\in \mathbb{R}^{n\times d}$ be the data that we have with $n$ data points and $d$ dimensional covariates. Let $\beta\in \mathbb{R}^{d\times 1}$ be the minimizer for $||Y-X\beta||^2$. Then, for any invertible $M\in \mathbb{R}^{d\times d}$, $||Y-X\beta||^2 = ||Y-XMM^{-1}\beta||^2$. This means that the OLS solution for $(Y, XM)$ is $M^{-1} \beta$. Substitute relevant values of $M$ and we’re done.
Technically, Simpson’s paradox usually refers to reversing trends when taking aggregates vs conditioning on a specific variable, whereas here we reverse trends by conditioning on two different variables. However, you can easily recover the aggregate setting by changing the story so that the first expert friend recommends not conditioning on anything at all, and the second expert friend recommends conditioning on $$Z’’ = Z + \frac{\beta_0 + \beta_1}{\alpha_0}X.$$
To muse even further, the world itself is a deterministic system (if you consider the wavefunction of the universe). What are the implicit assumptions about the world we’re making every time we do causal inference?