A look at collider bias
There has been a lot of buzz around “causal inference” and given how fundamental the name seems to statistics, I picked up The Book of Why: The New Science of Cause and Effect as a starting point. It’s been a great resource for introducing causal concepts and thinking about the importance of bringing subjectivity into the modeling process. This article looks into one of the concepts introduced: collider bias. All code snippets are written in R
.
What is collider bias?
In causal inference, the relationships between a network of events are represented in causal diagrams. In the causal diagram below, event A causes event B which causes event C. Here the term “event” refers to a random variable.
The direction in which arrows point into the events (forming a junction) have major implications as to how the model behaves. One junction in particular, which is the focus of this article, is called a collider. This is shown in the causal diagram below.
A collider occurs when an event is independently caused by two other events. In the example, event C is caused by events A and B.
Now, there is no causal relationship between A and B (since there are no arrows between the two). However, if the model were conditioned on C, a correlation between A and B would be induced that does not exist in the unconditional distribution. This is known as collider bias.
A basic example that the book gives is to repeatedly flip two fair coins simultaneously but only keep samples where at least one of the coins shows heads. Let’s do 10,000 flips and tabulate the results:
Code
# Load some packages
require(tidyverse)
# Set the seed
set.seed(123)
# Number of coin flips
<- 10000
n
# Generate random binomials
<- rbinom(n = n, size = 1, p = 0.5)
coin1 <- rbinom(n = n, size = 1, p = 0.5)
coin2
# Make a tibble
tibble(
`Coin 1` = coin1,
`Coin 2` = coin2
%>%
)
# Convert to factors
mutate_all(
~
%>%
.x factor() %>%
fct_recode(
Heads = "1",
Tails = "0"
%>%
) fct_relevel(
"Heads"
)%>%
)
### Filter to at least one heads
filter(
`Coin 1` == "Heads" |
`Coin 2` == "Heads"
%>%
)
# Count the rows
group_by_all() %>%
summarise(
N = n(),
.groups = "drop"
%>%
)
# Send over columns
pivot_wider(
names_from = `Coin 2`,
values_from = N,
values_fill = 0
%>%
)
# Make a kable
::kable(
knitrformat = "html",
caption = str_c("Tabulation of ", n, " simulataneously coin flips excluding cases when both were tails.")
%>%
) ::kable_styling(
kableExtrafull_width = FALSE
%>%
) ::add_header_above(
kableExtrac("", "Coin 2" = 2)
)
Coin 1 | Heads | Tails |
---|---|---|
Heads | 2403 | 2540 |
Tails | 2499 | 0 |
Now the question: Is the probability of a heads for one coin the same regardless of what the other coin shows? From our data,
\[P(\text{Coin 1 = Heads | Coin 2 = Heads}) = \frac{2403}{2403 + 2499} \approx 0.49\]
\[P(\text{Coin 1 = Heads | Coin 2 = Tails}) = \frac{2540}{2540 + 0} = 1\]
These results suggest that the coins are not independent. When coin 2 is heads, coin 1 is fair. When coin 2 is tails, coin 1 is certain to be heads. However, we flipped the coins independently so we know there is not a relationship between them. This is an example of collider bias. Here is the causal diagram for this model:
The decision to keep the result depends on the values of both coins. Once we condition on the trials that we kept, a correlation is induced.
Example: selecting patients for a trial
Let’s go through a more practical example. Note this is completely made up.
Key question
Do patients who live further from a trial center have more severe disease?
The setup
There is a clinical trial that we would like to recruit patients for which will be held at a single location. The study population consists of patients who have a disease of interest and live within 90 miles of the trial center. Suppose the following are true:
- 65% of patients live near (within 20 miles of) the trial center (denoted N)
- 25% of patients have severe disease regardless how far they live from the trial center (denoted S)
So we have…
\[P(N) = 0.65 \hskip{.25in} P(S|N) = 0.25 \hskip{.25in} P(S|\bar{N}) = 0.25\] Our sampling strategy will be to select patients at random from the entire population until 1000 are enrolled.
Code
# Set a seed
set.seed(321)
# Base sample size
<- 100000
base_size
# Enrollment size
<- 1000
n
# Make a data frame
<-
patients tibble(
S = rbinom(base_size, 1, .25), # Disease severity
N = rbinom(base_size, 1, .65), # Distance from center
%>%
)
# Add enrollment flag based on
mutate(
E =
# Determine enrollment probability
case_when(
== 1 ~ 0.85,
N == 1 ~ 0.50,
S TRUE ~ 0.10
%>%
)
# Sample based on probability
rbinom(
n = base_size,
size = 1
)%>%
)
# Filter to enrolled patients
filter(
== 1
E %>%
)
# Sample the desired enrollment
slice_sample(
n = n
%>%
)
# Convert to factors
mutate(
S =
%>%
S factor() %>%
fct_recode(
Severe = "1",
`Non-severe` = "0"
%>%
) fct_relevel("Severe"),
N =
%>%
N factor() %>%
fct_recode(
Near = "1",
Far = "0"
%>%
) fct_relevel("Near")
%>%
)
# Remove the enrollment indicator
select(
-E
)
The problem
Suppose we reach the desired enrollment size and we’re given a data set (called patients
) to do some preliminary analysis on the sample. Here are the first 5 rows:
Code
print(patients, n = 5)
# A tibble: 1,000 × 2
S N
<fct> <fct>
1 Non-severe Near
2 Severe Far
3 Severe Near
4 Non-severe Near
5 Non-severe Near
# ℹ 995 more rows
We have all 1000 patients represented in the data with two variables collected so far: disease severity (S) and distance from the trial center (N). Let’s tabulate these columns:
Code
%>%
patients
# Count the rows
group_by(N, S) %>%
summarise(
Count = n(),
.groups = "drop"
%>%
)
# Send over the columns
pivot_wider(
names_from = N,
values_from = Count
%>%
)
# Rename
rename(
`Disease severity` = S
%>%
)
# Make a kable
::kable(
knitrformat = "html",
caption = "Tabulation of disease severity and distance from trial center for 1000 sampled patients"
%>%
) ::kable_styling(
kableExtrafull_width = FALSE
%>%
) ::add_header_above(c("", "Distance from trial center" = 2)) kableExtra
Disease severity | Near | Far |
---|---|---|
Severe | 221 | 82 |
Non-severe | 650 | 47 |
There is something strange: Of the patients who live near the trial center, 25.4% have severe disease (which is expected), but of those living far from the trial center, 63.6% have severe disease. We know that only 25% of patients have severe disease regardless of where they live. Is this possible? Did this just happen by chance?
Code
prop.test(sum(patients$S == "Severe" & patients$N == "Far"), sum(patients$N == "Far"), p = 0.25)$p.value
[1] 1.321956e-23
Well, the p-value isn’t 0 (it will never be), so it is possible. However, there is probably something else going on. What else could be causing this discrepancy?
An explanation
If we think about the patients randomly selected and offered enrollment versus those ultimately deciding to enroll, a reasonable assumption might be that those with less-severe disease, especially those who also live far away, may be more reluctant to join. This is exactly what happened.
It turns out that the following are also true about this population with respect to their likelihood to join the trial (denoted J):
- 85% of patients living near the trial center will join if asked regardless of disease severity
- Of those living far from the trial center, 50% will join if they have severe disease and only 10% will join if they don’t.
In probability notation:
\[P(J|N) = 0.85 \hskip{.25in} P(J|\bar{N} \cap S) = 0.50 \hskip{.25in} P(J|\bar{N} \cap \bar{S}) = 0.10\] Even though every patient had the same opportunity to enroll in the trial, our realized sample became much more heavily weighted toward those living near the trial center, and much less weighted toward those living far from the trial center without severe disease. Thus, it conditioned on the patients that enrolled. As a result, a correlation between disease severity and the distance from the trial center was induced that isn’t there in the general population. The figure below displays the causal diagram for this relationship.
Does the math show it?
We know that 25% of patients in the general population have severe disease regardless of where they live. How does this probability change for patients who live far from the trial center when we condition on only those enrolled?
\[ \begin{equation} \begin{split} P(S|\bar{N} \cap J) & = \frac{P(S \cap \bar{N} \cap J)}{P(\bar{N} \cap J)} \\ & = \frac{P(S \cap \bar{N} \cap J)}{P(S \cap \bar{N} \cap J) + P(\bar{S} \cap \bar{N} \cap J)} \\ & = \frac{P(J|\bar{N} \cap S)P(\bar{N} \cap S)}{P(J|\bar{N} \cap S)P(\bar{N} \cap S) + P(J|\bar{N} \cap \bar{S})P(\bar{N} \cap \bar{S})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})P(\bar{N}) + P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})P(\bar{N})} \\ & = \frac{P(J|\bar{N} \cap S)P(S|\bar{N})}{P(J|\bar{N} \cap S)P(S|\bar{N})+ P(J|\bar{N} \cap \bar{S})P(\bar{S}|\bar{N})} \\ & = \frac{0.50 \times 0.25}{0.50 \times 0.25 + 0.10 \times (1 - 0.25)} \\ & = 0.625 \end{split} \end{equation} \] Ah, much better. Recall that in our sample 63.6% of patients who lived far from the trial center had severe disease. This is much closer to the true probability that we’d expect after conditioning on the correct factors.
Intuitively, this result makes sense. If patients who live far away and have less-severe disease are the most unlikely to join the trial, then a patient who is enrolled and lives far away is more likely to have severe disease.
Conclusion
Collider bias is a very interesting and important concept to consider when conducting a statistical analysis. In the examples throughout this article, we had the luxury of knowing the true probabilities of various events so we could easily identify and reconcile problems in the estimates. In practice, these are generally unknown quantities which makes this problem much more subtle and unapparent when relying solely on the data. Thus illustrating the importance of incorporating prior knowledge and subject-matter expertise into the modeling process.