A look at collider bias

Causal Inference

Author

Alex Zajichek

Published

July 15, 2021

There has been a lot of buzz around “causal inference” and given how fundamental the name seems to statistics, I picked up The Book of Why: The New Science of Cause and Effect as a starting point. It’s been a great resource for introducing causal concepts and thinking about the importance of bringing subjectivity into the modeling process. This article looks into one of the concepts introduced: collider bias. All code snippets are written in R.

What is collider bias?

In causal inference, the relationships between a network of events are represented in causal diagrams. In the causal diagram below, event A causes event B which causes event C. Here the term “event” refers to a random variable.

flowchart LR
  A --> B
  B --> C

The direction in which arrows point into the events (forming a junction) have major implications as to how the model behaves. One junction in particular, which is the focus of this article, is called a collider. This is shown in the causal diagram below.

flowchart LR
  A --> C
  B --> C

A collider occurs when an event is independently caused by two other events. In the example, event C is caused by events A and B.

Now, there is no causal relationship between A and B (since there are no arrows between the two). However, if the model were conditioned on C, a correlation between A and B would be induced that does not exist in the unconditional distribution. This is known as collider bias.

A basic example that the book gives is to repeatedly flip two fair coins simultaneously but only keep samples where at least one of the coins shows heads. Let’s do 10,000 flips and tabulate the results:

Code

# Load some packages
require(tidyverse)

# Set the seed
set.seed(123)

# Number of coin flips 
n <- 10000

# Generate random binomials
coin1 <- rbinom(n = n, size = 1, p = 0.5)
coin2 <- rbinom(n = n, size = 1, p = 0.5)

# Make a tibble
tibble(
  `Coin 1` = coin1,
  `Coin 2` = coin2
) %>%
  
  # Convert to factors
  mutate_all(
    ~
      .x %>%
      factor() %>%
      fct_recode(
        Heads = "1",
        Tails = "0"
      ) %>%
      fct_relevel(
        "Heads"
      )
  ) %>%
  
  ### Filter to at least one heads
  filter(
    `Coin 1` == "Heads" |
      `Coin 2` == "Heads"
  ) %>%
  
  # Count the rows
  group_by_all() %>%
  summarise(
    N = n(),
    .groups = "drop"
  ) %>%
  
  # Send over columns
  pivot_wider(
    names_from = `Coin 2`,
    values_from = N,
    values_fill = 0
  ) %>%
  
  # Make a kable
  knitr::kable(
    format = "html",
    caption = str_c("Tabulation of ", n, " simulataneously coin flips excluding cases when both were tails.")
  ) %>%
  kableExtra::kable_styling(
    full_width = FALSE
  ) %>%
  kableExtra::add_header_above(
    c("", "Coin 2" = 2)
  )

Tabulation of 10000 simulataneously coin flips excluding cases when both were tails.
	Coin 2
Coin 1	Heads	Tails
Heads	2403	2540
Tails	2499	0

Now the question: Is the probability of a heads for one coin the same regardless of what the other coin shows? From our data,

\[P(\text{Coin 1 = Heads | Coin 2 = Heads}) = \frac{2403}{2403 + 2499} \approx 0.49\]

\[P(\text{Coin 1 = Heads | Coin 2 = Tails}) = \frac{2540}{2540 + 0} = 1\]

These results suggest that the coins are not independent. When coin 2 is heads, coin 1 is fair. When coin 2 is tails, coin 1 is certain to be heads. However, we flipped the coins independently so we know there is not a relationship between them. This is an example of collider bias. Here is the causal diagram for this model:

flowchart LR
  A[Coin 1] --> C[Keep trial result]
  B[Coin 2] --> C

The decision to keep the result depends on the values of both coins. Once we condition on the trials that we kept, a correlation is induced.

Example: selecting patients for a trial

Let’s go through a more practical example. Note this is completely made up.

Key question

Do patients who live further from a trial center have more severe disease?

The setup

There is a clinical trial that we would like to recruit patients for which will be held at a single location. The study population consists of patients who have a disease of interest and live within 90 miles of the trial center. Suppose the following are true:

65% of patients live near (within 20 miles of) the trial center (denoted N)
25% of patients have severe disease regardless how far they live from the trial center (denoted S)

So we have…

\[P(N) = 0.65 \hskip{.25in} P(S|N) = 0.25 \hskip{.25in} P(S|\bar{N}) = 0.25\] Our sampling strategy will be to select patients at random from the entire population until 1000 are enrolled.

Code

# Set a seed
set.seed(321)

# Base sample size
base_size <- 100000

# Enrollment size
n <- 1000

# Make a data frame
patients <-
  tibble(
    S = rbinom(base_size, 1, .25), # Disease severity
    N = rbinom(base_size, 1, .65), # Distance from center
  ) %>%
  
  # Add enrollment flag based on
  mutate(
    E = 
      # Determine enrollment probability
      case_when(
        N == 1 ~ 0.85,
        S == 1 ~ 0.50,
        TRUE ~ 0.10
      ) %>%
      
      # Sample based on probability
      rbinom(
        n = base_size,
        size = 1
      )
  ) %>%
  
  # Filter to enrolled patients
  filter(
    E == 1
  ) %>%
  
  # Sample the desired enrollment
  slice_sample(
    n = n
  ) %>%
  
  # Convert to factors
  mutate(
    S = 
      S %>%
      factor() %>%
      fct_recode(
        Severe = "1",
        `Non-severe` = "0"
      ) %>%
      fct_relevel("Severe"),
    N = 
      N %>%
      factor() %>%
      fct_recode(
        Near = "1",
        Far = "0"
      ) %>%
      fct_relevel("Near")
  ) %>%
  
  # Remove the enrollment indicator
  select(
    -E
  )

The problem

Suppose we reach the desired enrollment size and we’re given a data set (called patients) to do some preliminary analysis on the sample. Here are the first 5 rows:

Code

print(patients, n = 5)

# A tibble: 1,000 × 2
  S          N    
  <fct>      <fct>
1 Non-severe Near 
2 Severe     Far  
3 Severe     Near 
4 Non-severe Near 
5 Non-severe Near 
# ℹ 995 more rows

We have all 1000 patients represented in the data with two variables collected so far: disease severity (S) and distance from the trial center (N). Let’s tabulate these columns:

Code

patients %>%
  
  # Count the rows
  group_by(N, S) %>%
  summarise(
    Count = n(),
    .groups = "drop"
  ) %>%
  
  # Send over the columns
  pivot_wider(
    names_from = N,
    values_from = Count
  ) %>%
  
  # Rename
  rename(
    `Disease severity` = S
  ) %>%
  
  # Make a kable
  knitr::kable(
    format = "html",
    caption = "Tabulation of disease severity and distance from trial center for 1000 sampled patients"
  ) %>%
  kableExtra::kable_styling(
    full_width = FALSE
  ) %>%
  kableExtra::add_header_above(c("", "Distance from trial center" = 2))

Tabulation of disease severity and distance from trial center for 1000 sampled patients
	Distance from trial center
Disease severity	Near	Far
Severe	221	82
Non-severe	650	47

There is something strange: Of the patients who live near the trial center, 25.4% have severe disease (which is expected), but of those living far from the trial center, 63.6% have severe disease. We know that only 25% of patients have severe disease regardless of where they live. Is this possible? Did this just happen by chance?

Code

prop.test(sum(patients$S == "Severe" & patients$N == "Far"), sum(patients$N == "Far"), p = 0.25)$p.value

[1] 1.321956e-23

Well, the p-value isn’t 0 (it will never be), so it is possible. However, there is probably something else going on. What else could be causing this discrepancy?

An explanation

If we think about the patients randomly selected and offered enrollment versus those ultimately deciding to enroll, a reasonable assumption might be that those with less-severe disease, especially those who also live far away, may be more reluctant to join. This is exactly what happened.

It turns out that the following are also true about this population with respect to their likelihood to join the trial (denoted J):

85% of patients living near the trial center will join if asked regardless of disease severity
Of those living far from the trial center, 50% will join if they have severe disease and only 10% will join if they don’t.

In probability notation:

\[P(J|N) = 0.85 \hskip{.25in} P(J|\bar{N} \cap S) = 0.50 \hskip{.25in} P(J|\bar{N} \cap \bar{S}) = 0.10\] Even though every patient had the same opportunity to enroll in the trial, our realized sample became much more heavily weighted toward those living near the trial center, and much less weighted toward those living far from the trial center without severe disease. Thus, it conditioned on the patients that enrolled. As a result, a correlation between disease severity and the distance from the trial center was induced that isn’t there in the general population. The figure below displays the causal diagram for this relationship.

flowchart LR
  A[Distance from center] --> C[Decision to join trial]
  B[Disease severity] --> C

Does the math show it?

We know that 25% of patients in the general population have severe disease regardless of where they live. How does this probability change for patients who live far from the trial center when we condition on only those enrolled?

\[ \begin{equation} \begin{split} P(S|\bar{N} \cap J) & = \frac{P(S \cap \bar{N} \cap J)}{P(\bar{N} \cap J)} \\ & = \frac{P(S \cap \bar{N} \cap J)}{P(S \cap \bar{N} \cap J) + P(\bar{S} \cap \bar{N} \cap J)} \\ & = \frac{P(J|\bar{N} \cap S)P(\bar{N} \cap S)}{P(J|\bar{N} \cap S)P(\bar{N} \cap S) + P(J|\bar{N} \cap \bar{S})P(\bar{N} \cap \bar{S})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N}) + P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})P(\bar{N})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})+ P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})} \\ & = \frac{0.50 \times 0.25}{0.50 \times 0.25 + 0.10 \times (1 - 0.25)} \\ & = 0.625 \end{split} \end{equation} \] Ah, much better. Recall that in our sample 63.6% of patients who lived far from the trial center had severe disease. This is much closer to the true probability that we’d expect after conditioning on the correct factors.

Intuitively, this result makes sense. If patients who live far away and have less-severe disease are the most unlikely to join the trial, then a patient who is enrolled and lives far away is more likely to have severe disease.

Conclusion

Collider bias is a very interesting and important concept to consider when conducting a statistical analysis. In the examples throughout this article, we had the luxury of knowing the true probabilities of various events so we could easily identify and reconcile problems in the estimates. In practice, these are generally unknown quantities which makes this problem much more subtle and unapparent when relying solely on the data. Thus illustrating the importance of incorporating prior knowledge and subject-matter expertise into the modeling process.

--- title: "A look at collider bias" author: "Alex Zajichek" date: "7/15/2021" image: "feature.png" categories: - Causal Inference format: html: code-fold: true code-tools: true --- There has been a lot of buzz around ["causal inference"](https://en.wikipedia.org/wiki/Causal_inference) and given how fundamental the name seems to statistics, I picked up [_The Book of Why: The New Science of Cause and Effect_](http://bayes.cs.ucla.edu/WHY/) as a starting point. It's been a great resource for introducing causal concepts and thinking about the importance of bringing subjectivity into the modeling process. This article looks into one of the concepts introduced: [_collider bias_](https://catalogofbias.org/biases/collider-bias/). All code snippets are written in [`R`](https://www.r-project.org/). # What is collider bias? In causal inference, the relationships between a network of events are represented in [_causal diagrams_](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7064555/). In the causal diagram below, event __A__ causes event __B__ which causes event __C__. Here the term "event" refers to a random variable. ```{mermaid} flowchart LR A --> B B --> C ``` The direction in which arrows point into the events (forming a _junction_) have major implications as to how the model behaves. One junction in particular, which is the focus of this article, is called a [_collider_](https://catalogofbias.org/biases/collider-bias/). This is shown in the causal diagram below. ```{mermaid} flowchart LR A --> C B --> C ``` A collider occurs when an event is independently caused by two other events. In the example, event __C__ is caused by events __A__ and __B__. Now, there is no causal relationship between __A__ and __B__ (since there are no arrows between the two). However, if the model were conditioned on __C__, a correlation between __A__ and __B__ would be induced that does not exist in the unconditional distribution. This is known as [_collider bias_](https://catalogofbias.org/biases/collider-bias/). A basic example that [the book](http://bayes.cs.ucla.edu/WHY/) gives is to repeatedly flip two fair coins simultaneously but only keep samples where at least one of the coins shows heads. Let's do 10,000 flips and tabulate the results: ```{r, message = FALSE} # Load some packages require(tidyverse) # Set the seed set.seed(123) # Number of coin flips n <- 10000 # Generate random binomials coin1 <- rbinom(n = n, size = 1, p = 0.5) coin2 <- rbinom(n = n, size = 1, p = 0.5) # Make a tibble tibble( `Coin 1` = coin1, `Coin 2` = coin2 ) %>% # Convert to factors mutate_all( ~ .x %>% factor() %>% fct_recode( Heads = "1", Tails = "0" ) %>% fct_relevel( "Heads" ) ) %>% ### Filter to at least one heads filter( `Coin 1` == "Heads" | `Coin 2` == "Heads" ) %>% # Count the rows group_by_all() %>% summarise( N = n(), .groups = "drop" ) %>% # Send over columns pivot_wider( names_from = `Coin 2`, values_from = N, values_fill = 0 ) %>% # Make a kable knitr::kable( format = "html", caption = str_c("Tabulation of ", n, " simulataneously coin flips excluding cases when both were tails.") ) %>% kableExtra::kable_styling( full_width = FALSE ) %>% kableExtra::add_header_above( c("", "Coin 2" = 2) ) ``` Now the question: Is the probability of a heads for one coin the same regardless of what the other coin shows? From our data, $$P(\text{Coin 1 = Heads | Coin 2 = Heads}) = \frac{2403}{2403 + 2499} \approx 0.49$$ $$P(\text{Coin 1 = Heads | Coin 2 = Tails}) = \frac{2540}{2540 + 0} = 1$$ These results suggest that the coins are not independent. When coin 2 is heads, coin 1 is fair. When coin 2 is tails, coin 1 is certain to be heads. However, we flipped the coins independently so we know there is not a relationship between them. This is an example of collider bias. Here is the causal diagram for this model: ```{mermaid} flowchart LR A[Coin 1] --> C[Keep trial result] B[Coin 2] --> C ``` The decision to keep the result depends on the values of both coins. Once we condition on the trials that we kept, a correlation is induced. # Example: selecting patients for a trial Let's go through a more practical example. _Note this is completely made up_. ## Key question **Do patients who live further from a trial center have more severe disease?** ## The setup There is a clinical trial that we would like to recruit patients for which will be held at a single location. The study population consists of patients who have a disease of interest and live within 90 miles of the trial center. Suppose the following are true: 1. 65% of patients live _near_ (within 20 miles of) the trial center (denoted __N__) 2. 25% of patients have _severe_ disease regardless how far they live from the trial center (denoted __S__) So we have... $$P(N) = 0.65 \hskip{.25in} P(S|N) = 0.25 \hskip{.25in} P(S|\bar{N}) = 0.25$$ Our sampling strategy will be to select patients at random from the entire population until 1000 are enrolled. ```{r} # Set a seed set.seed(321) # Base sample size base_size <- 100000 # Enrollment size n <- 1000 # Make a data frame patients <- tibble( S = rbinom(base_size, 1, .25), # Disease severity N = rbinom(base_size, 1, .65), # Distance from center ) %>% # Add enrollment flag based on mutate( E = # Determine enrollment probability case_when( N == 1 ~ 0.85, S == 1 ~ 0.50, TRUE ~ 0.10 ) %>% # Sample based on probability rbinom( n = base_size, size = 1 ) ) %>% # Filter to enrolled patients filter( E == 1 ) %>% # Sample the desired enrollment slice_sample( n = n ) %>% # Convert to factors mutate( S = S %>% factor() %>% fct_recode( Severe = "1", `Non-severe` = "0" ) %>% fct_relevel("Severe"), N = N %>% factor() %>% fct_recode( Near = "1", Far = "0" ) %>% fct_relevel("Near") ) %>% # Remove the enrollment indicator select( -E ) ``` ## The problem Suppose we reach the desired enrollment size and we're given a data set (called `patients`) to do some preliminary analysis on the sample. Here are the first 5 rows: ```{r} print(patients, n = 5) ``` We have all `r n` patients represented in the data with two variables collected so far: disease severity (__S__) and distance from the trial center (__N__). Let's tabulate these columns: ```{r} patients %>% # Count the rows group_by(N, S) %>% summarise( Count = n(), .groups = "drop" ) %>% # Send over the columns pivot_wider( names_from = N, values_from = Count ) %>% # Rename rename( `Disease severity` = S ) %>% # Make a kable knitr::kable( format = "html", caption = "Tabulation of disease severity and distance from trial center for 1000 sampled patients" ) %>% kableExtra::kable_styling( full_width = FALSE ) %>% kableExtra::add_header_above(c("", "Distance from trial center" = 2)) ``` There is something strange: Of the patients who live near the trial center, `r filter(patients, N == "Near") %>% with(., round(mean(S == "Severe") * 100, 1))`% have severe disease (which is expected), but of those living far from the trial center, `r filter(patients, N == "Far") %>% with(., round(mean(S == "Severe") * 100, 1))`% have severe disease. We know that only 25% of patients have severe disease regardless of where they live. Is this possible? Did this just happen by chance? ```{r} prop.test(sum(patients$S == "Severe" & patients$N == "Far"), sum(patients$N == "Far"), p = 0.25)$p.value ``` Well, the p-value isn't 0 (it will never be), so it _is_ possible. However, there is probably something else going on. What else could be causing this discrepancy? ## An explanation If we think about the patients randomly selected and offered enrollment versus those ultimately deciding to enroll, a reasonable assumption might be that those with less-severe disease, especially those who also live far away, may be more reluctant to join. This is exactly what happened. It turns out that the following are also true about this population with respect to their likelihood to _join_ the trial (denoted __J__): 1. 85% of patients living near the trial center will join if asked regardless of disease severity 2. Of those living far from the trial center, 50% will join if they have severe disease and only 10% will join if they don't. In probability notation: $$P(J|N) = 0.85 \hskip{.25in} P(J|\bar{N} \cap S) = 0.50 \hskip{.25in} P(J|\bar{N} \cap \bar{S}) = 0.10$$ Even though every patient had the same opportunity to enroll in the trial, our realized sample became much more heavily weighted toward those living near the trial center, and much less weighted toward those living far from the trial center without severe disease. Thus, it conditioned on the patients that enrolled. As a result, a correlation between disease severity and the distance from the trial center was induced that isn't there in the general population. The figure below displays the causal diagram for this relationship. ```{mermaid} flowchart LR A[Distance from center] --> C[Decision to join trial] B[Disease severity] --> C ``` ### Does the math show it? We know that 25% of patients in the general population have severe disease regardless of where they live. How does this probability change for patients who live far from the trial center when we condition on only those enrolled? $$ \begin{equation} \begin{split} P(S|\bar{N} \cap J) & = \frac{P(S \cap \bar{N} \cap J)}{P(\bar{N} \cap J)} \\ & = \frac{P(S \cap \bar{N} \cap J)}{P(S \cap \bar{N} \cap J) + P(\bar{S} \cap \bar{N} \cap J)} \\ & = \frac{P(J|\bar{N} \cap S)P(\bar{N} \cap S)}{P(J|\bar{N} \cap S)P(\bar{N} \cap S) + P(J|\bar{N} \cap \bar{S})P(\bar{N} \cap \bar{S})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N}) + P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})P(\bar{N})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})+ P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})} \\ & = \frac{0.50 \times 0.25}{0.50 \times 0.25 + 0.10 \times (1 - 0.25)} \\ & = 0.625 \end{split} \end{equation} $$ Ah, much better. Recall that in our sample `r filter(patients, N == "Far") %>% with(., round(mean(S == "Severe") * 100, 1))`% of patients who lived far from the trial center had severe disease. This is much closer to the true probability that we'd expect after conditioning on the correct factors. Intuitively, this result makes sense. If patients who live far away and have less-severe disease are the most unlikely to join the trial, then a patient who is enrolled and lives far away is more likely to have severe disease. # Conclusion Collider bias is a very interesting and important concept to consider when conducting a statistical analysis. In the examples throughout this article, we had the luxury of knowing the true probabilities of various events so we could easily identify and reconcile problems in the estimates. In practice, these are generally unknown quantities which makes this problem much more subtle and unapparent when relying solely on the data. Thus illustrating the importance of incorporating prior knowledge and subject-matter expertise into the modeling process.