When Data Lies: Simpson’s Paradox–A Step-by-Step Simulation Code Notebook

Justin Belair

Biostatistician and RWE Expert in Pharma, Biotech, & Science | Consultant | Author of Causal Inference in Statistics

Find the RMarkdown Notebook on Github and Run the Code Yourself!

Introduction: A Medical Mystery

Imagine you’re analyzing data from a clinical trial testing a new treatment. You run a simple chi-square test and find that the treatment has absolutely no effect—the test statistic is exactly zero. Case closed, right?

Not so fast. What if I told you that when you analyze the same data separately for men and women, the treatment shows a significant positive effect for both groups? This isn’t a statistical error—it’s Simpson’s Paradox, one of the most counterintuitive phenomena in data analysis.

The Classic Example

Let’s examine the famous example¹ that E.H. Simpson used to illustrate this paradox in 1951, scaled up for statistical power. Consider a hypothetical medical treatment given to male and female patients, where we track survival outcomes.

# Recreate Simpson's original example data
simpsons_original_example_male <- data.frame(
  Status = c("Alive", "Dead"),
  Untreated = c(800, 600),
  Treated = c(1600, 1000)
)

simpsons_original_example_female <- data.frame(
  Status = c("Alive", "Dead"),  
  Untreated = c(400, 600),
  Treated = c(2400, 3000)
)

# Display the data
knitr::kable(simpsons_original_example_male, caption = "Male Patients")

Male Patients
Status	Untreated	Treated
Alive	800	1600
Dead	600	1000

knitr::kable(simpsons_original_example_female, caption = "Female Patients")

Female Patients
Status	Untreated	Treated
Alive	400	2400
Dead	600	3000

Let’s test whether treatment and survival are independent for each group:

# Chi-square tests for males and females separately
male_test <- chisq.test(as.matrix(simpsons_original_example_male[,2:3]), correct = FALSE)
female_test <- chisq.test(as.matrix(simpsons_original_example_female[,2:3]), correct = FALSE)

# Display results
results <- data.frame(
  Sex = c("Male", "Female"),
  Chi_Square = round(c(male_test$statistic, female_test$statistic), 2),
  P_Value = c("< 0.01", "< 0.01")
)

knitr::kable(results, caption = "Chi-Square Tests by Sex")

Chi-Square Tests by Sex
Sex	Chi_Square	P_Value
Male	7.33	< 0.01
Female	6.77	< 0.01

Both tests are highly significant! The treatment appears to have a positive effect on survival for both men and women.

The Plot Twist

Now let’s look at the aggregated data:

# Aggregate the data across sexes
simpsons_aggregated <- data.frame(
  Status = c("Alive", "Dead"),
  Untreated = c(1200, 1200),  # 800+400, 600+600
  Treated = c(4000, 4000)     # 1600+2400, 1000+3000
)

knitr::kable(simpsons_aggregated, caption = "Aggregated Data (All Patients)")

Aggregated Data (All Patients)
Status	Untreated	Treated
Alive	1200	4000
Dead	1200	4000

# Chi-square test for aggregated data
aggregated_test <- chisq.test(as.matrix(simpsons_aggregated[,2:3]), correct = FALSE)
print(paste("Chi-square statistic:", round(aggregated_test$statistic, 2)))

## [1] "Chi-square statistic: 0"

print(paste("P-value:", round(aggregated_test$p.value, 2)))

## [1] "P-value: 1"

The chi-square statistic is exactly zero! The data fits perfectly with what we’d expect if treatment and survival were completely independent.

The Paradoxical Question

As Pearl and colleagues eloquently state:

“The data seem to say that if we know the patient’s gender—male or female—we can prescribe the drug, but if the gender is unknown we should not! Obviously, that conclusion is ridiculous. If the drug helps men and women, it must help anyone; our lack of knowledge of the patient’s gender cannot make the drug harmful.”

So what’s going on here? Should we adjust for sex or not?

A Modern Simulation

Let’s create our own Simpson’s Paradox example with simulated data to see how this phenomenon can arise:

n <- 400 
sd_x <- 2
  
group <- sample(1:5,n, replace = T)
mu_x <- 5.5*(1:5)
group <- factor(group)

n_1 <- sum(group==1)
x_1 <- rnorm(n_1, mu_x[1], sd_x)

n_2 <- sum(group==2)
x_2 <- rnorm(n_2, mu_x[2], sd_x)

n_3 <- sum(group==3)
x_3 <- rnorm(n_3, mu_x[3], sd_x)

n_4 <- sum(group==4)
x_4 <- rnorm(n_4, mu_x[4], sd_x)

n_5 <- sum(group==5)
x_5 <- rnorm(n_5, mu_x[5], sd_x)

y_1 <- 30 + -1*x_1 + rnorm(n_1, 0, 2.5)
y_2 <- 40 + -1*x_2 + rnorm(n_2, 0, 3.5)
y_3 <- 50 + -1*x_3 + rnorm(n_3, 0, 4.5)
y_4 <- 60 + -1*x_4 + rnorm(n_4, 0, 3.5)
y_5 <- 70 + -1*x_5 + rnorm(n_5, 0, 2.5)

x <- c(x_1, x_2, x_3, x_4, x_5)
y <- c(y_1, y_2, y_3, y_4, y_5)

group <- factor(c(rep(1, n_1), rep(2, n_2), rep(3, n_3), rep(4, n_4), rep(5, n_5)))
data <- data.frame(x = x, y = y, group = group)

simpsons.paradox.hidden.plot <- data %>%
  ggplot(aes(x=x, y=y)) +
  geom_point() +
  geom_smooth(method="lm", se = F) +
  ggtitle("Simulated data (hiding Simpson's Paradox)") +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank())

print(simpsons.paradox.hidden.plot)

simpsons.paradox.exhibited.plot <- data %>%
  ggplot(aes(x=x, y=y, colour = group)) +
  geom_point() +
  geom_smooth(method="lm", se = F) +
  ggtitle("Simulated data (exhibiting Simpson's Paradox)") +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank())

print(simpsons.paradox.exhibited.plot)

Now let’s look at the regression coefficients:

lm <- lm(y~x, data=data)
knitr::kable(coef(summary(lm)), 
             caption = "Overall Regression: Y ~ X", 
             digits = 3)

Overall Regression: Y ~ X
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	21.683	0.561	38.676	0
x	0.730	0.031	23.844	0

lm.simpsons.paradox <- lm(y~x + group, data=data)
knitr::kable(coef(summary(lm.simpsons.paradox)), 
             caption = "Grouped Regression: Y ~ X + Group", 
             digits = 3)

Grouped Regression: Y ~ X + Group
	Estimate	Std. Error	t value
(Intercept)	29.519	0.586	50.400
x	-0.932	0.082	-11.300
group2	9.723	0.697	13.954
group3	19.674	1.070	18.389
group4	28.749	1.459	19.700
group5	38.724	1.853	20.897

The overall relationship shows a positive slope, but when we control for the grouping variable, the relationship between x and y becomes negative!

The Causal Inference Solution

The answer to the Simpson’s Paradox puzzle cannot be solved by statistical arguments alone. It requires understanding the causal relationships between variables.

The key question is: Is sex a confounder or not? If sex causally affects both treatment assignment and outcomes, then we should adjust for it. The basic confounding scenario can be represented with a simple causal diagram where Sex (\(S\)) affects both Treatment (\(X\)) and Outcome (\(Y\)), as follows.

theme_set(theme_dag())

confounder_dag <- dagify(Y ~ X + S,
                   X ~ S,
                   exposure = "X",
                   outcome = "Y",
                   coords = list(x = c(Y = 1, X = -1, S = 0)/2,
                                 y = c(Y = 0, X = 0, S = 1)/2)
                   )

confounder_dag <- ggdag(confounder_dag, text = TRUE, node_size = 16*1.5, text_size = 3.88*1.5) +
  geom_dag_edges_link(arrow = grid::arrow(length = grid::unit(15, "pt"), type = "closed"))

print(confounder_dag)

Other DAG structures imply other adjustment strategies: make sure you understand the causal process that generated your data before performing controls such as regression adjustment, stratification, propensity score techniques, etc! If you want to learn more about DAGs, Chapter 3 of my book is already available if you pre-order and contains a full introduction to causal diagrams, including exercises, code, and a case-study.

Why This Paradox Matters

Simpson’s Paradox isn’t just a statistical curiosity—it has real-world implications:

Medical studies: Treatment effects can appear different when stratified by patient characteristics
Educational research: Intervention effects may vary across schools or demographics
Business analytics: Marketing campaigns might show different effectiveness across customer segments
Policy evaluation: Program impacts can depend on population subgroups

Key Takeaways

Aggregation can hide important patterns: Always explore your data at different levels of granularity
Context matters: Statistical significance doesn’t tell you whether to adjust for a variable
Causal thinking is essential: Understanding the data-generating process helps resolve apparent paradoxes
Visualization helps: Plotting data by subgroups can reveal hidden relationships

The Bigger Picture

Simpson’s Paradox is just one example of how statistical analysis without causal reasoning can lead us astray. In my upcoming book, Causal Inference in Statistics, with Exercises, Practice Projects, and R Code Notebooks, I dive deep into these concepts, dissolving statistical one-by-one (Simpson’s pardox, Lord’s Paradox, Berkson’s Paradox, etc.) and providing you with the full-suite of tools to think causally about your data.

The book includes:

Detailed explanations of causal concepts with real-world examples
Step-by-step R/Python code for all analyses
(Many!) Practice exercises to solidify your understanding
Complete case studies with real-world datasets you can analyze yourself

Want to master causal thinking in statistics? Download the first chapter for free and see how causal inference can transform your approach to data analysis.

Causal Inference Communities

If you enjoyed this post and want to receive monthly insights about causal inference in statistics, subscribe to my newsletter. Join the stats nerds! 🤓

You can also join the Causal Inference Nerds Discord Community, a lively community of causal inference enthusiasts, from beginner to expert, eager to share and learn together. We host a journal club, a book club, we share memes, and more. It’s a lot of fun!

Simpson, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2), 238–241. Oxford University Press. https://www.jstor.org/stable/i349694 ↩︎

When Data Lies: Simpson’s Paradox–A Step-by-Step Simulation Code Notebook

Justin Belair

Table of Contents

Introduction: A Medical Mystery

The Classic Example

The Plot Twist

The Paradoxical Question

A Modern Simulation

The Causal Inference Solution

Why This Paradox Matters

Key Takeaways

The Bigger Picture

Causal Inference Communities

Recent Posts

Chemistry, Manufacturing, and Control (CMC) in Pharmaceuticals and Biologics: A Comprehensive Regulatory Guide

Chemistry, Manufacturing, and Controls (CMC) in Drug Development, Pharmaceuticals, and Biologics

Functional Genomics and Bioinformatics: Applications, Challenges, and Case-Studies

Functional Genomics and Bioinformatics: A Full Guide to Techniques, Applications, and Approaches

Tags