Find the RMarkdown Notebook on Github and Run the Code Yourself!
Introduction - What is Collider Bias?
Collider bias occurs when we condition on (or select based on) a variable that is influenced by both the exposure and outcome of interest. This seemingly innocent action can create spurious associations between variables that are actually independent. Let’s explore this through some concrete examples.
Example: College Admissions
Consider college admissions where students can be admitted based on
either high intellectual ability or high athletic ability. Let’s
simulate some data where these abilities are actually independent in the
population. Next, we create an indicator for admission
based on whether a student has high intellectual ability or high
athletic ability. We then plot the data to see how the selection process
affects the relationship between intellectual and athletic ability.
selection.bias <- data.frame("intellectual.ability" = rnorm(500, 0, 1),
"athletic.ability" = rnorm(500, 0, 1)
) %>%
mutate(admission = (intellectual.ability > 1) | (athletic.ability > 1.5)
)
If you want to access the code to create the plot below, the code RMarkdown notebook used for this blog post is available for free on Github.
From this plot, we see that there is basically no relationship between intellectual and athletic ability in the population. Indeed, when fitting a linear regression model, we get a slightly negative slope.
lm(athletic.ability ~ intellectual.ability, data= selection.bias) %>%
summary()
##
## Call:
## lm(formula = athletic.ability ~ intellectual.ability, data = selection.bias)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.94113 -0.74507 0.01663 0.72882 3.11131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04496 0.04730 -0.951 0.342
## intellectual.ability -0.04308 0.04678 -0.921 0.358
##
## Residual standard error: 1.057 on 498 degrees of freedom
## Multiple R-squared: 0.0017, Adjusted R-squared: -0.0003048
## F-statistic: 0.848 on 1 and 498 DF, p-value: 0.3576
Yet, the coefficient is not significantly different from 0. We know (because we generated the data) that the real-value is 0.
However, when we condition on admission
, we see a strong
negative relationship between intellectual and athletic ability.
This is confirmed by estimating the coefficient of the linear regression
model when conditioning on
admission
.
lm(athletic.ability ~ intellectual.ability, data= selection.bias %>% filter(admission)) %>%
summary()
##
## Call:
## lm(formula = athletic.ability ~ intellectual.ability, data = selection.bias %>%
## filter(admission))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1910 -0.6481 0.1565 0.8005 2.1926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2034 0.1540 7.816 2.96e-12 ***
## intellectual.ability -0.7471 0.1071 -6.974 2.14e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.143 on 114 degrees of freedom
## Multiple R-squared: 0.299, Adjusted R-squared: 0.2929
## F-statistic: 48.63 on 1 and 114 DF, p-value: 2.142e-10
We obtain a highly significant negative coefficient. While this is visually intuitive in this specific example, it is not always so clear in real-world data. This is why it is important to learn to draw meaningful DAGs to represent background knowledge. In this specific example, it would look like this.