Selection Bias, A Causal Inference Perspective (With Downloadable Code Notebook)

Picture of Justin Belair

Justin Belair

Biostatistician in Science & Tech | Consultant | Causal Inference SpecialistBiostatistician in Science & Tech | Consultant | Causal Inference Specialist

Table of Contents

collider_bias.knit

Find the RMarkdown Notebook on Github and Run the Code Yourself!

Introduction - What is Collider Bias?

Collider bias occurs when we condition on (or select based on) a variable that is influenced by both the exposure and outcome of interest. This seemingly innocent action can create spurious associations between variables that are actually independent. Let’s explore this through some concrete examples.

Example: College Admissions

Consider college admissions where students can be admitted based on either high intellectual ability or high athletic ability. Let’s simulate some data where these abilities are actually independent in the population. Next, we create an indicator for admission based on whether a student has high intellectual ability or high athletic ability. We then plot the data to see how the selection process affects the relationship between intellectual and athletic ability.

selection.bias <- data.frame("intellectual.ability" = rnorm(500, 0, 1),
                             "athletic.ability" = rnorm(500, 0, 1)
                             ) %>%
  mutate(admission = (intellectual.ability > 1) | (athletic.ability > 1.5)
         )

If you want to access the code to create the plot below, the code RMarkdown notebook used for this blog post is available for free on Github.

From this plot, we see that there is basically no relationship between intellectual and athletic ability in the population. Indeed, when fitting a linear regression model, we get a slightly negative slope.

lm(athletic.ability ~ intellectual.ability, data= selection.bias) %>%
  summary()
## 
## Call:
## lm(formula = athletic.ability ~ intellectual.ability, data = selection.bias)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94113 -0.74507  0.01663  0.72882  3.11131 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)          -0.04496    0.04730  -0.951    0.342
## intellectual.ability -0.04308    0.04678  -0.921    0.358
## 
## Residual standard error: 1.057 on 498 degrees of freedom
## Multiple R-squared:  0.0017, Adjusted R-squared:  -0.0003048 
## F-statistic: 0.848 on 1 and 498 DF,  p-value: 0.3576

Yet, the coefficient is not significantly different from 0. We know (because we generated the data) that the real-value is 0.

However, when we condition on admission, we see a strong negative relationship between intellectual and athletic ability.

This is confirmed by estimating the coefficient of the linear regression model when conditioning on admission.

lm(athletic.ability ~ intellectual.ability, data= selection.bias %>% filter(admission)) %>%
  summary()
## 
## Call:
## lm(formula = athletic.ability ~ intellectual.ability, data = selection.bias %>% 
##     filter(admission))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1910 -0.6481  0.1565  0.8005  2.1926 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.2034     0.1540   7.816 2.96e-12 ***
## intellectual.ability  -0.7471     0.1071  -6.974 2.14e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.143 on 114 degrees of freedom
## Multiple R-squared:  0.299,  Adjusted R-squared:  0.2929 
## F-statistic: 48.63 on 1 and 114 DF,  p-value: 2.142e-10

We obtain a highly significant negative coefficient. While this is visually intuitive in this specific example, it is not always so clear in real-world data. This is why it is important to learn to draw meaningful DAGs to represent background knowledge. In this specific example, it would look like this.