The overlap weight in survival analysis

A simulation review

I was recently introduced to overlap weighting, which is part of a general family of methods for balancing covariates when estimating treatment effects with observational data. Specifically, it focuses on the clinical equipoise; that is, the patients in which the treatment decision is most uncertain. I find it more elegant than matching (which I’ve used in the past), and figured a useful way to better understand the approach is to deconstruct, interpret, and translate a SAS simulation in the context of survival analysis implemented by the original authors. All code is written in (and translated to) R. We’ll start by loading some packages.

library(tidyverse)
library(survival)
library(reactable)

Setting up the potential outcomes

The theoretical underpinnings of overlap weighting live in the context of the potential outcomes paradigm of causal inference. Basically, we wonder what would have happened had we been able to observe each patient under both treatments (known as the counterfactual). If we knew the outcomes from the two worlds, the causal treatment effect would simply be the average difference across all patients. The problem of course is that in reality the outcome can only be observed for the treatment in which the patient was assigned (or chose), and so our goal is to work with the “incomplete” information that we do have to try to estimate what the outcome difference would have been had the counterfactual been observed as well.

Simulate some patients

The first thing we need to do is generate some (fake) patients to facilitate the simulation. Here we’ll focus on age, sex, and income as patient-identifying characteristics.

# Set the sample size
n <- 10000

# Set the seed
set.seed(123)

# Build the population
population <-
  tibble(
    age = rnorm(n, mean = 50, sd = 10),
    zage = (age - 50) / 10,
    male = rbinom(n, size = 1, prob = 0.6),
    income = rnorm(n, mean = 50, sd = 10),
    zincome = (income - 50) / 10
  )
population
## # A tibble: 10,000 × 5
##      age    zage  male income zincome
##    <dbl>   <dbl> <int>  <dbl>   <dbl>
##  1  44.4 -0.560      0   36.5  -1.35 
##  2  47.7 -0.230      1   44.2  -0.579
##  3  65.6  1.56       1   41.4  -0.861
##  4  50.7  0.0705     1   59.7   0.973
##  5  51.3  0.129      0   56.2   0.619
##  6  67.2  1.72       1   63.9   1.39 
##  7  54.6  0.461      1   35.1  -1.49 
##  8  37.3 -1.27       1   56.4   0.639
##  9  43.1 -0.687      1   53.7   0.375
## 10  45.5 -0.446      1   53.7   0.370
## # … with 9,990 more rows

We’ve generated a sample of 10000 patients. We’ll assume age is measured in years, income in thousands of dollars ($), and male is 1 for Male and 0 for Female. The columns zage and zincome are just standardized versions of the originals to avoid scaling nuances (we’ll refer to these as \(age_z\) and \(income_z\)).

Define the treatment propensity

Here’s where the important theory starts to creep in (already). We assume that there is a true propensity score, say \(p_i^A\), that is the true probability that patient i receives treatment A given their specific characteristics (and let’s assume there are possible treatments A & B). Further, we assume (some transformation of) this probability is a linear combination of all the characteristics that confound the crude treatment effect on the outcome. In this simulation, we’ll assume the logit model:

\[log(\frac{p_i^A}{1 - p_i^A}) = 0.41 \times age_z - 0.22 \times male - 0.69 \times income_z - 0.40\]

So let’s add this true propensity score to the population data set:

population <-
  population |>
  
  # Add the true propensity score (unknown quantity)
  mutate(
    log_odds_pA = -0.4 + log(1.5)*zage + log(0.8)*male + log(.50)*zincome,
    pA = 1 / (1 + exp(-log_odds_pA))
  )
population
## # A tibble: 10,000 × 7
##      age    zage  male income zincome log_odds_pA    pA
##    <dbl>   <dbl> <int>  <dbl>   <dbl>       <dbl> <dbl>
##  1  44.4 -0.560      0   36.5  -1.35        0.311 0.577
##  2  47.7 -0.230      1   44.2  -0.579      -0.315 0.422
##  3  65.6  1.56       1   41.4  -0.861       0.606 0.647
##  4  50.7  0.0705     1   59.7   0.973      -1.27  0.219
##  5  51.3  0.129      0   56.2   0.619      -0.777 0.315
##  6  67.2  1.72       1   63.9   1.39       -0.888 0.292
##  7  54.6  0.461      1   35.1  -1.49        0.595 0.644
##  8  37.3 -1.27       1   56.4   0.639      -1.58  0.171
##  9  43.1 -0.687      1   53.7   0.375      -1.16  0.238
## 10  45.5 -0.446      1   53.7   0.370      -1.06  0.257
## # … with 9,990 more rows

Obviously in practice we don’t know what \(p_i^A\) is, so it must be estimated from the treatment assignments we observe in the data. Additionally, we aren’t strictly required to assume the logit model, but when we use it to estimate the propensity score for overlap weighting, it has the advantageous property of perfect balance. That is, the weighted-mean differences for all covariates included in the logistic regression model will be zero across the treatment groups.

Generate the treatment assignment

The treatment assignment is one of the known quantities we would observe in a real life sample, and that is what would be used as the dependent variable for estimating propensity scores. However, since we are running a simulation, we need to generate the treatment assignments from the governing distribution. We assume that the observed treatment for patient i is the result of an (unfair) coin flip that has a probability equal to the true propensity score. In statistical notation,

\[A_i \sim Bernoulli(p_i^A)\]

where \(A_i\) is the indicator of treatment A. Let’s add a realized treatment assignment to the population:

population <-
  population |>
  
  # Add the realized treatment assignment (known quantity)
  mutate(
    A = rbinom(n, size = 1, prob = pA),
  )
population
## # A tibble: 10,000 × 8
##      age    zage  male income zincome log_odds_pA    pA     A
##    <dbl>   <dbl> <int>  <dbl>   <dbl>       <dbl> <dbl> <int>
##  1  44.4 -0.560      0   36.5  -1.35        0.311 0.577     1
##  2  47.7 -0.230      1   44.2  -0.579      -0.315 0.422     0
##  3  65.6  1.56       1   41.4  -0.861       0.606 0.647     1
##  4  50.7  0.0705     1   59.7   0.973      -1.27  0.219     0
##  5  51.3  0.129      0   56.2   0.619      -0.777 0.315     1
##  6  67.2  1.72       1   63.9   1.39       -0.888 0.292     1
##  7  54.6  0.461      1   35.1  -1.49        0.595 0.644     0
##  8  37.3 -1.27       1   56.4   0.639      -1.58  0.171     0
##  9  43.1 -0.687      1   53.7   0.375      -1.16  0.238     0
## 10  45.5 -0.446      1   53.7   0.370      -1.06  0.257     0
## # … with 9,990 more rows

In this case, A=1 represents treatment A and A=0 represents treatment B. Again, this is the actual treatment we would observe the patient receiving in a sample.

Compute the (true) overlap weight

We need to define what the overlap weight is, and it’s actually quite simple: just assign the patient the probability they did not receive their observed treatment.

\[ \begin{equation} OW_i= \begin{cases} 1-p_i^A & \text{if } A_i=1\\ p_i^A & \text{if } A_i=0\\ \end{cases} \end{equation} \]

Notice the notation. Since it depends on the true propensity score, the overlap weight is another unknown quantity that is estimated from the data. It also depends on the realized treatment assignment, so the collection of weights across patients differ depending on the observed treatment distribution. We can add these weights to the population data set:

population <-
  population |>
  
  # Add the true overlap weight for the realized treatment (unknown quantity)
  mutate(
    OW = A * (1-pA) + (1-A) * pA
  )
population
## # A tibble: 10,000 × 9
##      age    zage  male income zincome log_odds_pA    pA     A    OW
##    <dbl>   <dbl> <int>  <dbl>   <dbl>       <dbl> <dbl> <int> <dbl>
##  1  44.4 -0.560      0   36.5  -1.35        0.311 0.577     1 0.423
##  2  47.7 -0.230      1   44.2  -0.579      -0.315 0.422     0 0.422
##  3  65.6  1.56       1   41.4  -0.861       0.606 0.647     1 0.353
##  4  50.7  0.0705     1   59.7   0.973      -1.27  0.219     0 0.219
##  5  51.3  0.129      0   56.2   0.619      -0.777 0.315     1 0.685
##  6  67.2  1.72       1   63.9   1.39       -0.888 0.292     1 0.708
##  7  54.6  0.461      1   35.1  -1.49        0.595 0.644     0 0.644
##  8  37.3 -1.27       1   56.4   0.639      -1.58  0.171     0 0.171
##  9  43.1 -0.687      1   53.7   0.375      -1.16  0.238     0 0.238
## 10  45.5 -0.446      1   53.7   0.370      -1.06  0.257     0 0.257
## # … with 9,990 more rows

In a real analysis, these weights are what we are ultimately after in order to estimate the average causal treatment effect on the outcome of interest while balancing differences in patient characteristics across the groups (i.e., removing the confounding factors).

Target population

Now we do end up normalizing the weights so they sum to one within each treatment group. But the intuition about what’s happening is that the focus of the treatment effect estimation is shifted to patients that are most likely in either treatment group. This is known as the clinical equipoise, and the resulting treatment effect is interpreted as the average treatment effect in the overlap population.

This makes a lot of sense. When we’re comparing treatments, we should focus most heavily on patients that could be candidates for either, and less so on patients that were bound for one. Thus, we up-weight patients who have the most overlap in characteristics with the opposing treatment group. The beautiful thing here is that we do this without throwing away any information; smoothly and proportionately weighting each subject just the amount that we should.

Set the treatment effect

We’ve generated the treatments and established how they are related to patient characteristics, but haven’t talked about the outcome in which we’re ultimately interested in estimating the treatment effect for. Since we’re focusing on time-to-event outcomes, we’ll stay in that context, but the general idea is that we assume there exists a realization of what a patient’s outcome would have been under each treatment scenario. Then if we compare the difference of those outcomes across all patients, the average difference must be caused by the treatment.

Defining the event times

In this simulation, we’ll generate event times from a Weibull distribution. Starting at treatment initiation, this might be the time until cancer recurrence, hospitalization, or something else; we’re just looking to see how long it takes for some event to occur. The PDF for this distribution looks like this:

\[f(t) = \frac{\alpha}{\sigma}\left(\frac{t}{\sigma}\right)^{\alpha-1}e^{-\left(\frac{t}{\sigma}\right)^\alpha}\]

where \(t\) is the event time, \(\alpha\) is the shape parameter, and \(\sigma\) is the scale parameter. In our example, we will say that:

\[T_i^A \sim Weibull(\alpha, \sigma_i^A)\] \[T_i^B \sim Weibull(\alpha, \sigma_i^B)\] where \(T_i^A\) and \(T_i^B\) are the event times under treatments A and B, respectively. That is, the event times for each patient under each treatment follow a Weibull distribution with a common shape parameter (in this case, across both treatments), but a scale parameter that depends on the patient’s specific characteristics (in log-linear form) and differs by treatment, which captures the treatment effect. Specifically,

\[\alpha = 1\] \[\sigma_i^A = \lambda \times e^{-(\phi_i - 0.36)}\] \[\sigma_i^B = \lambda \times e^{-\phi_i}\] \[\phi_i = 1.10 \times age_z - 0.22 \times male - 0.36 \times income_z\] \[\lambda = 4055.56\]

Basically, the \(\phi_i\) term does the baseline adjustment on the outcome risk for the patient’s specific characteristics, and the treatment effect is simply a fixed, additive deviation from that for all patients. The \(\lambda\) term is the baseline hazard function, providing the scale parameter when all covariate values are zero (for treatment B), which in this case is constant over time.

Interpreting the treatment effect

When setting \(\alpha=1\), the mean of a Weibull distribution is equal to its scale parameter. That is,

\[E[T_i^A] = \sigma_i^A\] \[E[T_i^B] = \sigma_i^B\] This gives us a very nice interpretation of the treatment effect. If we compare them relatively, we get:

\[ \begin{equation} \begin{split} \frac{E[T_i^A]}{E[T_i^B]} & = \frac{\sigma_i^A}{\sigma_i^B} \\ & = \frac{\lambda \times e^{-(\phi_i - 0.36)}}{\lambda \times e^{-\phi_i}} \\ & = \frac{e^{-(\phi_i - 0.36)}}{e^{-\phi_i}} \\ & = \frac{e^{0.36}e^{-\phi_i}}{e^{-\phi_i}} \\ & = e^{0.36} \\ & \approx 1.43 \end{split} \end{equation} \]

So we can say that the time to event for patients on treatment A is 1.43 times, or 43%, longer on average than those on treatment B, given a fixed age, sex, and income. That is, treatment A is “better” than treatment B.

Sampling the event times

Next, we need to actually sample the times for our population (note that it was confirmed that SAS and R have the same parameterizations):

population <-
  population |>
  
  # Sample outcome event times under each treatment (counterfactual)
  mutate(
    
    # Baseline hazard (constant over time)
    lambda = 1*365/0.09,
    
    # Define TRUE Weibull linear predictor
    phi = log(3) * zage + log(0.8) * male + log(.70) * zincome,
    sigma_A = lambda * exp(-(phi + log(0.70))),
    sigma_B = lambda * exp(-phi),
    
    # Generate REALIZED survival times under each treatment scenario (same parameterizations in SAS)
    t_A = rweibull(n, shape = 1, scale = sigma_A),
    t_B = rweibull(n, shape = 1, scale = sigma_B), 
  )
population |> select(lambda, phi, sigma_A, sigma_B, t_A, t_B)
## # A tibble: 10,000 × 6
##    lambda     phi sigma_A sigma_B    t_A   t_B
##     <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl>
##  1  4056. -0.133    6617.   4632.  5689. 7760.
##  2  4056. -0.269    7585.   5309.  1831. 9482.
##  3  4056.  1.80      961.    673.   488.  300.
##  4  4056. -0.493    9482.   6637. 23722.  255.
##  5  4056. -0.0788   6269.   4388.  7649. 1800.
##  6  4056.  1.17     1804.   1263.  1953. 1493.
##  7  4056.  0.814    2568.   1797.  5462. 2652.
##  8  4056. -1.84    36514.  25560. 38262. 2385.
##  9  4056. -1.11    17604.  12323.  3577. 6894.
## 10  4056. -0.845   13485.   9439. 38541. 3743.
## # … with 9,990 more rows

Now we can make some density plots comparing the event time distributions across treatments (we’ll use log scaling for visual appeal):

population |>
  
  # Send down the rows
  pivot_longer(
    cols = starts_with("t_"),
    names_to = "Treatment",
    values_to = "EventTime",
    names_prefix = "t_"
  ) |> 
  
  # Make a plot
  ggplot() + 
  geom_density(
    aes(
      x = log(EventTime),
      fill = Treatment
    ),
    alpha = .75
  ) + 
  scale_x_continuous(labels = function(x) round(exp(x))) +
  theme(
    panel.background = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "top"
  ) +
  xlab("Event Time")

Notice the shift to the right in the distribution for treatment A. This is the treatment effect. In the hypothetical world where all patients received treatment A, the event times happened later than in the world where all patients received treatment B (and all event times were observed), indicating, on average, a “benefit” to treatment A, which was expected from our prior calculation.

Assign the observed outcome

The final step for simulation setup is to assign the event time as we might observe in a real data set. The event times generated in the previous section capture both treatment scenarios for all patients, but in reality we would only (potentially) observe the event time for the treatment the patient received. Additionally, we likely (or definitely) won’t be following all patients long enough to observe all events occur, meaning some patients will be censored. Let’s add the observed outcomes to the population:

population <-
  population |> 
  
  # Add the observed outcome
  mutate(
    
    # The ACTUAL event time outcome depends on the treatment ACTUALLY observed for the patient
    actual_event_time = t_B * (1 - A) + t_A * A,
    
    # Generate a REALIZED censoring time (completely random)
    censor_time = 500 + 500 * runif(n),
    
    # Calculate the OBSERVED time in the data set (known quantity)
    time = pmin(actual_event_time, censor_time), # They either had the event, or were censored first
    
    # Calculate the event status (TRUE if event observed, FALSE if censored)
    status = as.numeric(actual_event_time < censor_time)
  )
population |> select(actual_event_time, censor_time, time, status)
## # A tibble: 10,000 × 4
##    actual_event_time censor_time  time status
##                <dbl>       <dbl> <dbl>  <dbl>
##  1             5689.        843.  843.      0
##  2             9482.        814.  814.      0
##  3              488.        882.  488.      1
##  4              255.        936.  255.      1
##  5             7649.        653.  653.      0
##  6             1953.        890.  890.      0
##  7             2652.        737.  737.      0
##  8             2385.        703.  703.      0
##  9             6894.        952.  952.      0
## 10             3743.        526.  526.      0
## # … with 9,990 more rows

Our final simulated data set has the following observed outcome summaries:

population |>
  
  # Compute summary metrics
  summarize(
    Patients = n(),
    Time = mean(time),
    .by = c(A, status)
  ) |> 
  
  # Add shares within treatment
  mutate(
    Percent = Patients / sum(Patients),
    .by = A
  ) |>
  
  # Clean/rearrange
  transmute(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Status = 
      case_when(
        status == 1 ~ "Event",
        TRUE ~ "Censored"
      ),
    Patients,
    Percent,
    Time
  ) |>
  arrange(Treatment, desc(Status)) |>
  
  # Make a table
  reactable(
    groupBy = "Treatment",
    columns = 
      list(
        Patients = colDef(aggregate = "sum", align = "center"),
        Percent = colDef(name = "Percent of patients in treatment group (%)", aggregate = "sum", align = "center", format = colFormat(digits = 1, percent = TRUE)),
        Time = colDef(name = "Avg. time to outcome", aggregate = zildge::rectbl_agg_wtd("Patients"), align = "center", format = colFormat(digits = 1))
      ),
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    resizable = TRUE,
    theme = reactablefmtr::sandstone()
  ) |>
  reactablefmtr::add_source("Use arrows to expand table", font_size = 12, font_style = "italic")

Use arrows to expand table


We can also look at the survival curves.

population |>
  
  # Nest by treatment
  group_by(A) |>
  nest() |>
  
  # Esimtate survival curves for each treatment
  mutate(
    Surv = 
      data |>
      map(
        function(.trt) {
          
          # Fit the model
          mod <- survfit(Surv(time, status) ~ 1, data = .trt)
          
          # Extract the elements
          tibble(
            Time = mod$time,
            Survival = mod$surv,
            Lower = mod$lower,
            Upper = mod$upper
          )
          
        }
      )
  ) |> 
  
  # Unnest the curves
  select(-data) |>
  unnest(cols = Surv) |>
  ungroup() |>

  # Clean labels
  mutate(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      )
  ) |>
  
  # Make a plot
  ggplot(
    aes(
      x = Time,
      y = Survival
    )
  ) + 
  geom_line(aes(color = Treatment)) +
  geom_ribbon(
    aes(
      ymin = Lower,
      ymax = Upper,
      fill = Treatment
    ),
    alpha = .25
  ) +
  scale_y_continuous(labels = scales::percent) +
  theme(
    panel.background = element_blank(),
    panel.grid.major.y = element_line(color = "gray"),
    legend.position = "top"
  ) +
  xlab("Time since treatment") +
  ylab("Survival Probability")

If we weren’t doing a simulation, we wouldn’t know how much of these differences are explained by the treatment. And actually, these crude survival curves suggest longer survival from treatment B, even though we know the opposite is true. The goal is to use this known information (along with patient characteristics) to estimate the causal treatment effect that is baked into all of the unknown quantities we have here that would be unavailable to us in a real-life analysis.

Estimating the weights

Now that our simulation is set, we can start using the observed data (i.e., known quantities) to estimate the case-weights needed for treatment effect estimation. This is the process we’d go through in a real-life analysis, but here we have the benefit of knowing the truth, so we can see how close our estimations are to reality.

The intention of these weights is to adjust the observed sample to create a pseudo-population such that the patient characteristics that we believe are confounding the treatment effect are corrected for, or balanced, across the treatment groups. The term “pseudo” here is a little misleading. It’s simply referring to the fact that each patient will not contribute the same weight in estimating the treatment effect. In fact, the patients with the most overlap in characteristics across the treatment groups will contribute the most, with patients being proportionately down-weighted the less overlap they have (I’d argue this is the same thing that is done in matching, it’s just that the weights for patients are either exactly 1 or 0). This is done to tease out the portion of the outcome differences that is explained by the treatment, namely, the causal treatment effect.

Model the propensity scores

We’ve established that the overlap weights are quantities that need to be estimated from the data. In order to calculate those weights, we first need to estimate the propensity scores. We know what the true model is, but let’s assume we are only working with observable data:

population |> select(zage, male, zincome, A, time, status)
## # A tibble: 10,000 × 6
##       zage  male zincome     A  time status
##      <dbl> <int>   <dbl> <int> <dbl>  <dbl>
##  1 -0.560      0  -1.35      1  843.      0
##  2 -0.230      1  -0.579     0  814.      0
##  3  1.56       1  -0.861     1  488.      1
##  4  0.0705     1   0.973     0  255.      1
##  5  0.129      0   0.619     1  653.      0
##  6  1.72       1   1.39      1  890.      0
##  7  0.461      1  -1.49      0  737.      0
##  8 -1.27       1   0.639     0  703.      0
##  9 -0.687      1   0.375     0  952.      0
## 10 -0.446      1   0.370     0  526.      0
## # … with 9,990 more rows

Our goal is to obtain patient-specific probabilities of receiving treatment A. We’ll estimate this with a logistic regression model, but we’ll allow for some flexibility in the shape of the relationships between the continuous variables (age and income) and the outcome with the use of restricted cubic splines (see my other post for another explanation):

# Fit the propensity score model
ps_mod <-
  glm(
    formula = A ~ rms::rcs(zage, 3) + rms::rcs(zincome, 3) + factor(male),
    data = population,
    family = "binomial"
  )
summary(ps_mod)
## 
## Call:
## glm(formula = A ~ rms::rcs(zage, 3) + rms::rcs(zincome, 3) + 
##     factor(male), family = "binomial", data = population)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0098  -0.9636  -0.6644   1.1489   2.3504  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -0.30027    0.06527  -4.600 4.22e-06 ***
## rms::rcs(zage, 3)zage         0.40206    0.05099   7.885 3.16e-15 ***
## rms::rcs(zage, 3)zage'       -0.07337    0.05813  -1.262    0.207    
## rms::rcs(zincome, 3)zincome  -0.63792    0.04950 -12.888  < 2e-16 ***
## rms::rcs(zincome, 3)zincome' -0.04020    0.06434  -0.625    0.532    
## factor(male)1                -0.25432    0.04415  -5.760 8.41e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13359  on 9999  degrees of freedom
## Residual deviance: 12193  on 9994  degrees of freedom
## AIC: 12205
## 
## Number of Fisher Scoring iterations: 4

The model seems to indicate that the non-linear terms for age and income don’t particularly matter (which is expected), but we’ll leave it as-is. Next, let’s attach the estimated propensity scores to the population:

population <-
  population |>
  
  # Add the estimated propensity score
  mutate(
    pA_hat = predict(ps_mod, type = "response") # P(A = 1 | X)
  )

We can take a look at the estimated propensity score distribution across the treatment groups (we’ll also overlay the true distributions for comparison):

population |>
  
  # Send PS scores down the rows
  pivot_longer(
    cols = c(pA, pA_hat),
    names_to = "Type",
    values_to = "Score"
  ) |>

  # Clean labels
  mutate(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Type = 
      case_when(
        Type == "pA" ~ "True",
        TRUE ~ "Estimated"
      )
  ) |>
  
  # Make a plot
  ggplot() + 
  geom_density(
    aes(
      x = Score,
      fill = Treatment,
      linetype = Type
    ),
    alpha = .40
  ) +
  scale_x_continuous(labels = scales::percent) +
  theme(
    panel.background = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "top"
  ) +
  xlab("P(A=1|X)")

The model does a great job at estimating the true propensity scores. In a real analysis we probably wouldn’t be this close since we likely wouldn’t have only and all true confounders accounted for.

Visualizing propensity score effects

In the same vein, we can explore the modeled relationships between each patient characteristic and the propensity scores.

plots <-
  population |>
  
  # Send PS scores down the rows
  pivot_longer(
    cols = c(pA, pA_hat),
    names_to = "Type",
    values_to = "Score"
  ) |>
  
  # Clean labels
  mutate(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Type = 
      case_when(
        Type == "pA" ~ "True",
        TRUE ~ "Estimated"
      )
  ) |>
  
  # Send covariates down the rows
  pivot_longer(
    cols = c(zage, zincome, male),
    names_prefix = "^z"
  ) |>
  
  # Make groups
  mutate(
    Group = 
      case_when(
        name == "male" ~ "categorical",
        TRUE ~ "continuous"
      )
  ) |>
  
  # Make a nested frame
  group_by(Group) |>
  nest() |>
  
  # Make plot depending on type
  mutate(
    plot = 
      data |>
      map(
        function(.group) {
          
          # Check condition
          if(n_distinct(.group$value) == 2) {
            
            .group |>
              summarize(
                Rate = mean(Score),
                .by = c(Treatment, Type, name, value)
              ) |>
              mutate(
                Sex = 
                  case_when(
                    value == 1 ~ "Male",
                    TRUE ~ "Female"
                  ),
                name = "Sex"
              ) |>
              ggplot() +
              geom_point(
                aes(
                  x = Sex,
                  y = Rate,
                  color = Treatment,
                  shape = Type,
                  group = Type
                )
              ) +
              geom_line(
                aes(
                  x = Sex,
                  y = Rate,
                  color = Treatment,
                  linetype = Type,
                  group = interaction(Type, Treatment)
                )
              ) +
              facet_wrap(~name) +
              coord_flip() +
              scale_y_continuous(labels = scales::percent) +
              theme(
                panel.background = element_blank(),
                panel.grid.major.x = element_line(color = "gray"),
                legend.position = "top",
                axis.title.y = element_blank()
              ) +
              ylab("P(A=1|X)")
            
          } else {
            
            .group |> 
              ggplot() + 
              geom_smooth(
                aes(
                  x = value,
                  y = Score,
                  color = Treatment,
                  fill = Treatment,
                  linetype = Type
                ),
                alpha = .25
              ) +
              facet_wrap(~name) +
              scale_y_continuous(labels = scales::percent) +
              theme(
                panel.background = element_blank(),
                panel.grid.major.y = element_line(color = "gray"),
                legend.position = "top"
              ) +
              xlab("Z-Score") +
              ylab("P(A=1|X)")
          }
          
        } 
      )
  ) 

gridExtra::grid.arrange(plots$plot[[1]], plots$plot[[2]])

As known from the true model, patients who received treatment A tend to be older, female patients with lower income. We can also see slight miscalibration in the estimates for patients who are older in that the model tends to underestimate the true propensity score in these patients. We see similar miscalibration for sex.

Calculate the (estimated) overlap weight

We’ve already defined how to calculate the overlap weights. The only difference here is that we’ll do it from the estimated propensity scores instead of the true ones. First, we’ll apply the formula to add the estimated weights to the population:

population <-
  population |>
  
  # Add the estimated overlap weight for the realized treatment (known quantity)
  mutate(
    OW_hat = A * (1-pA_hat) + (1-A) * pA_hat
  )

As we’ve mentioned earlier, we will then normalize the weights within each treatment group so they have the same cumulative contribution for estimating the treatment effect in the outcome model.

population <- 
  population |>
  
  # Add the normalized weights (adding true and estimated)
  mutate(
    OW_norm = OW / sum(OW),
    OW_hat_norm = OW_hat / sum(OW_hat),
    .by = A
  )
population |> select(A, pA, pA_hat, OW, OW_norm, OW_hat, OW_hat_norm)
## # A tibble: 10,000 × 7
##        A    pA pA_hat    OW   OW_norm OW_hat OW_hat_norm
##    <int> <dbl>  <dbl> <dbl>     <dbl>  <dbl>       <dbl>
##  1     1 0.577  0.583 0.423 0.000202   0.417   0.000198 
##  2     0 0.422  0.427 0.422 0.000201   0.427   0.000202 
##  3     1 0.647  0.610 0.353 0.000168   0.390   0.000185 
##  4     0 0.219  0.226 0.219 0.000105   0.226   0.000107 
##  5     1 0.315  0.329 0.685 0.000327   0.671   0.000318 
##  6     1 0.292  0.265 0.708 0.000338   0.735   0.000348 
##  7     0 0.644  0.628 0.644 0.000307   0.628   0.000297 
##  8     0 0.171  0.181 0.171 0.0000814  0.181   0.0000856
##  9     0 0.238  0.250 0.238 0.000114   0.250   0.000118 
## 10     0 0.257  0.268 0.257 0.000123   0.268   0.000127 
## # … with 9,990 more rows

To get a sense of the impact of these weights, let’s first look at their distributions (again, adding the true weights for comparison):

population |>
  
  # Send overlap weight down the rows
  pivot_longer(
    cols = c(OW_norm, OW_hat_norm),
    names_to = "Type",
    values_to = "Weight"
  ) |>
  
  # Clean labels
  mutate(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Type = 
      case_when(
        Type == "OW_norm" ~ "True",
        TRUE ~ "Estimated"
      )
  ) |>
  
  # Make a plot
  ggplot() + 
  geom_density(
    aes(
      x = Weight,
      fill = Treatment,
      linetype = Type
    ),
    alpha = .40
  ) +
  theme(
    panel.background = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "top"
  ) +
  xlab("Normalized Overlap Weight")

In aggregate, the groups will have equal weight. The distribution shift has to do with the sampled treatment distribution. Only 38.8% of patients have treatment A so at an individual level each patient will contribute more on average.

Cumulative patient contribution

We can also look at the accumulation of patients (i.e., when each patient contributes equally) as a function of the cumulative overlap weight in each group. This allows us to understand how much (or little) the treatment effect estimation will be dominated by smaller concentrations of patients.

population |>
  
  # Send overlap weight down the rows
  pivot_longer(
    cols = c(OW_norm, OW_hat_norm),
    names_to = "Type",
    values_to = "Weight"
  ) |>
  
  # Clean labels
  mutate(
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Type = 
      case_when(
        Type == "OW_norm" ~ "True",
        TRUE ~ "Estimated"
      ),
    NullWeight = 1
  ) |> 
  
  # Rearrange
  arrange(Treatment, Type, Weight) |>
  
  # Compute the cumulative weight
  mutate(
    Weight = cumsum(Weight) / sum(Weight),
    NullWeight = cumsum(NullWeight) / sum(NullWeight),
    .by = c(Treatment, Type)
  ) |> 
  
  # Make a plot
  ggplot() + 
  geom_line(
    aes(
      x = Weight,
      y = NullWeight,
      color = Treatment,
      linetype = Type
    ),
    linewidth = 1
  ) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  theme(
    panel.background = element_blank(),
    panel.grid.major.x = element_line(color = "gray"),
    panel.grid.major.y = element_line(color = "gray"),
    legend.position = "top"
  ) +
  xlab("Cumulative percent of overlap weight (%)") +
  ylab("Cumulative percent of patients (%)") 

We can see that for treatment B, about 40% of the outcome model weight will be accounted for by only 25% of patients. The weights for treatment A are slightly more evenly spread across the patients.

Weighted-mean differences

It was previously mentioned that the overlap weight methodology leads to perfect balance among the covariates used in the propensity score model. We can verify by looking at the group means for each model factor before and after weighting.

population |> 
  
  # Add uniform weights
  add_column(NullWeight = 1) |>
  
  # Send weights down the rows
  pivot_longer(
    cols = c(NullWeight, OW_hat_norm),
    names_to = "Type",
    values_to = "Weight"
  ) |> 
  
  # Send factors down the rows
  pivot_longer(
    cols = c(age, income, male),
    names_to = "Factor",
    values_to = "Value"
  ) |> 
  
  # For each 
  summarize(
    Patients = n(),
    Mean = sum(Weight * Value) / sum(Weight),
    .by = c(Factor, Type, A)
  ) |>
  
  # Clean up
  transmute(
    Factor = 
      case_when(
        Factor == "age" ~ "Age (years)",
        Factor == "income" ~ "Income ($)",
        Factor == "male" ~ "Male (%)"
      ),
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      ),
    Type,
    Patients,
    Mean = 
      case_when(
        Factor == "Male (%)" ~ 100 * Mean,
        TRUE ~ Mean
      )
  ) |>
  
  # Send over the columns
  pivot_wider(
    names_from = Type,
    values_from = c(Patients, Mean)
  ) |> 
  
  # Rearrange
  arrange(Factor, Treatment) |>
  select(Factor, Treatment, ends_with("NullWeight"), Mean_OW_hat_norm) |> 
  
  # Make a table
  reactable(
    groupBy = "Factor",
    columns = 
      list(
        Patients_NullWeight = colDef(name = "Patients", aggregate = "sum", align = "center"),
        Mean_NullWeight = colDef(name = "Before Weighting", aggregate = zildge::rectbl_agg_wtd("Patients_NullWeight"), align = "center", format = colFormat(digits = 1)),
        Mean_OW_hat_norm = colDef(name = "After Weighting", aggregate = zildge::rectbl_agg_wtd("Patients_NullWeight"), align = "center", format = colFormat(digits = 1))
      ),
    columnGroups = 
      list(
        colGroup(
          name = "Mean Value",
          columns = c("Mean_NullWeight", "Mean_OW_hat_norm")
        )
      ),
    striped = TRUE,
    highlight = TRUE,
    bordered = TRUE,
    resizable = TRUE,
    theme = reactablefmtr::sandstone()
  ) |>
  reactablefmtr::add_source("Use arrows to expand table", font_size = 12, font_style = "italic")

Use arrows to expand table

We can see that, overall, the weighted sample is slightly older and female with lower income.

Confounder weight shifts

Similar to looking at weight attributions as a whole, we can explore weight shifts within subgroups of patient characteristics. This helps build intuition about which patients the subsequent treatment effect estimates will focus on most (and least).

population |> 
  
  # Add uniform weights
  add_column(NullWeight = 1) |>
  
  # Convert to quntiles
  mutate(
    across(
      c(age, income),
      \(x) Hmisc::cut2(x, g = 10)
    ),
    Age = age,
    Income = income,
    Sex = 
      case_when(
        male == 1 ~ "Male",
        TRUE ~ "Female"
      ),
    Treatment = 
      case_when(
        A == 1 ~ "A",
        TRUE ~ "B"
      )
  ) |> 
  
  # Send weights down the rows
  pivot_longer(
    cols = c(NullWeight, OW_hat_norm),
    names_to = "Type",
    values_to = "Weight"
  ) |>
  
  # Send factors down the rows
  pivot_longer(
    cols = c(Age, Income, Sex),
    names_to = "Factor",
    values_to = "Level"
  ) |> 
  
  # Compute total weights
  summarize(
    Weight = sum(Weight),
    .by = c(Factor, Level, Type, Treatment)
  ) |>
  
  # Find percent of weight
  mutate(
    Weight = Weight / sum(Weight),
    .by = c(Factor, Type, Treatment)
  ) |>
  
  # Send over columns
  pivot_wider(
    names_from = Type,
    values_from = Weight
  ) |>
  
  # Clean up
  mutate(
    Level = factor(Level) |> fct_rev(),
    Factor = 
      case_when(
        Factor == "Age" ~ "Age (years)",
        Factor == "Income" ~ "Income ($)",
        TRUE ~ Factor
      )
  ) |>
  
  # Make a plot
  ggplot() +
  geom_col(
    aes(
      x = Level,
      y = NullWeight,
      fill = Treatment
    ),
    color = "black",
    linewidth = .75,
    alpha = .40,
    position = "dodge"
  ) + 
  geom_point(
    aes(
      x = Level,
      y = OW_hat_norm,
      color = Treatment,
      shape = Treatment
    ),
    position = position_dodge(width = 1),
    size = 3
  ) +
  geom_line(
    aes(
      x = Level,
      y = OW_hat_norm,
      color = Treatment,
      group = Treatment
    ),
    position = position_dodge(width = 1),
    linewidth = .5
  ) +
  facet_wrap(~Factor, scales = "free") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  theme(
    panel.background = element_blank(),
    panel.grid.major.x = element_line(color = "gray"),
    legend.position = "top"
  ) +
  xlab("Level") +
  ylab("Share of weight within group (%)") +
  labs(
    fill = "Before Weighting",
    shape = "After Weighting",
    color = "After Weighting"
  )

We can see the balancing take place. After overlap weighting, the contribution to treatment effect estimation is reduced in older patients with lower income for treatment A, and vice-versa for treatment B, with some levels in the middle having increased weight for both treatments. This is where the most overlap is between the groups.

It is also useful to look at these plots for factors that were not included in the propensity score model. We may find drastic weight shifts which might indicate significant co-variation among factors already accounted for.

Estimating the treatment effect

We’re now ready to estimate the causal treatment effect. Although we generated our data from a Weibull distribution, we will use a Cox proportional-hazards model for estimation, which is a semi-parametric method, and tends to be the default choice for modeling survival data, especially in healthcare.

A look at the true hazard ratio

The treatment effect will be quantified by a hazard ratio, which is an estimate of the ratio of instantaneous event rates between the treatment groups (this is what is done in the original simulation). This measure is different than the effect we interpreted from the true data-generating process, so, although we won’t expect them to provide the same numerical result, we should still get a similar conclusion, in that treatment A is superior to treatment B.

To start, we can compute the actual hazard ratio had we been able to observe each potential outcome (i.e., the counterfactual). To do this, we need to transform our data such that we have two rows of data per patient–one for each treatment. Then, we fit the Cox model, using the true overlap weights as case weights.

# Fit the model
true_hr_mod <-
  coxph(
    formula = Surv(time, status) ~ factor(A),
    data = 
      population |>
      
      # Keep columns needed
      select(
        OW_norm, # True (normalized) overlap weight (could use un-normalized version here)
        t_A, # True event time under treatment A
        t_B, # True event time under treatment B
        censor_time # When the patient was censored
      ) |>
      
      # Send true event times down the rows
      pivot_longer(
        cols = c(t_A, t_B),
        names_to = "A",
        values_to = "time"
      ) |>
      
      # Check if censored first
      mutate(
        A = case_when(A == "t_A" ~ 1, TRUE ~ 0), # Replace with our notation
        time = pmin(time, censor_time),
        status = as.numeric(time < censor_time)
      ),
    weights = OW_norm
  )

# Extract the result
true_hr <- exp(true_hr_mod$coefficients)[[1]]
true_hr
## [1] 0.7284168

If we could compare the worlds in which we observe all patients under each treatment, the hazard ratio is 0.73, suggesting that the instantaneous event rate is 27% lower with treatment A than with treatment B. This aligns with what we’ve established as the true effect. Our hope is that the estimate from the observable data (in the next section) will contain this value within some margin of sampling error.

Our estimate of the treatment effect

Finally, we can obtain our estimate of the treatment effect using only the observable quantities we have in our data set, which is what we would have in a real-life analysis. Similar to the previous section, we are just fitting a Cox model with the treatment as a covariate. Except this time we only observe one outcome per patient as a result of the treatment they actually received, and use the estimated overlap weight to balance the confounding factors (age, sex, and income). We will also use robust standard error estimates.

# Fit the model
estimated_hr_mod <-
  coxph(
    formula = Surv(time, status) ~ factor(A),
    data = population,
    weights = OW_hat_norm,
    robust = TRUE
  )

# Extract the result
estimated_hr <- exp(estimated_hr_mod$coefficients)[[1]]
estimated_hr
## [1] 0.7423806
# Extract the confidence interval
estimated_hr_ci <- exp(confint(estimated_hr_mod))
estimated_hr_ci
##                2.5 %    97.5 %
## factor(A)1 0.6756889 0.8156548

Our estimate for the hazard ratio is 0.74, with a 95% confidence interval ranging from 0.68 to 0.82, suggesting that the instantaneous event rate for treatment A is between 18% and 32% lower compared to treatment B (note that this captures the true hazard ratio). Under the assumption that we’ve correctly specified all confounding factors (which we did), we can interpret this as the causal treatment effect, and, assuming the magnitude of improvement is of practical importance (when considering things like cost, resources, clinical outcomes, etc.), we can reliably conclude that there is a benefit to treatment A over treatment B, on average.

Resources

Some additional resources I’ve come across while learning the methodology:

  • Original paper for overlap weighting (link)
  • Paper extending overlap weighting to survival analysis (link)
  • Example simulation of overlap weighting in the survival setting in SAS (link)
  • R package for implementing overlap weight methods (link)
  • R functions for overlap weighting in the survival setting (link)
  • Author’s web page dedicated to overlap weighting (link)
Alex Zajichek
Alex Zajichek
Statistician & Data Scientist
comments powered by Disqus

Related