Statistical Rethinking 2023 Class Notes

Bayesian Statistics
Causal Inference
Author

Alex Zajichek

Published

January 1, 2023

This document is intended to be a repository for my (raw, unedited) notes, interpretations, examples, and summaries from the Statistical Rethinking 2023 course (which Richard McElreath has graciously made available for free (!) covering his book). I’m not actually enrolled in the course, but just casually following the lectures and material. I have a strong interest in learning and incorporating Bayesian analysis and causal principles into my work, and this seemed like a great opportunity to build a foundation for that.

Table of Contents

  1. Science Before Statistics
  2. Garden of Forking Data
  3. Geocentric Models
  4. Categories and Curves
Code
# Load some packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1. Science Before Statistics

Week # Lecture # Chapter(s) Week End Notes Taken
1 1 1 1/6/2023 1/3/2023

Summary

This course focus is on scientific modeling via causal inference, which is focused on identifying causes in observational data. Causal Inference requires us to consider the mechanism of a phenomenon, and think about not only which variables cause other variables, but in what order–subject matter expertise is of utmost importance, and we don’t really depend on the data at hand until the very end of our inference process. Causal modeling must become the foundation to do analysis by–we can’t just do simple statistics in one project and then think about causal modeling in another–samples are from populations and there are causes associated with why we observed the sample we did, even if we’re answering very basic questions. Also, Bayesian modeling as a means to performing causal inference is not due to philosophical reasons (e.g., frequentist vs. Bayesian), it’s more so because a Bayesian framework provides the most natural tools to employ the specified causal model (i.e., if the frequentist model made sense for answering the causal question, we’d use it). The generative aspect of Bayesian modeling is one aspect in particular that makes it very inviting to represent the causal model in a statistical framework, and apply distributions. Finally, coding is not just a means to employ the math, but rather needs to be treated as part of the product, therefore employing software engineering principles, having documentation, making things reproducible. These things need to be employed if you really want to advance knowledge with confidence.

Notes

Overview

  • Most interested in Causal Inference, focusing on the science before the statistics
  • We must be able to talk about causes to obtain scientific knowledge, why else would we do it?
  • Causes can’t be extracted from data; must come from knowledge, assumptions

What is Causal Inference?

  • It is more than associations; associations are bi-directional, and correlation is only a basic measure of association;
  • It is all about intervention, directionality, and the prediction of the consequence of changing one variable on another (asking what-if?)

Causal Imputation

  • This is about being able to construct counterfactual outcomes
  • Asking the question, what if I had done something else?
  • We only observe a single outcome, but we want to know what would have happened had a certain intervention not taken place

Directed Acyclic Graph (DAG)

  • Nothing more than an abstraction about which variables cause which other variables
  • Shows the direction at which variables cause each other, but doesn’t specify how (i.e., effect shapes, etc.)
  • We can use this to know which things to control for, answer hypothetical interventions, under the assumption that the model is true
  • It provides a tool to answer very specific questions (queries); not necessarily all questions lead to the same statistical model, but the appropriate statistical model can be derived from the causal model depending on the question
  • Intuition Pumps: Gets the researcher to think about mechanism; great way to problem solve with SME’s without looking at the data (which is how it should be)

Golems (statistical models)

  • Metaphor for what a statistical model is; it’s a very useful machine that will do what it’s asked very well, but has no wisdom or forethought
  • Does not know the intent of the task
  • Statistical models are just objective tools, but we need causal models to know how and when certain models are actually appropriate

Statistical Models

  • Having a flowchart of tests is not useful, except maybe in the experimental setting (remember we’re talking observational data)
  • Statistical models/tests don’t make a clear relationship between the research and the data; it’s just math

Hypotheses & Models

  • We need generative causal models that are guided by the DAG’s
  • We need estimands that are statistical models justified by the generative models (how do we quantify what we’re after?)
  • Introduce real data at the end–this is the easy part

Justifying Controls

  • Cannot just control for everything in your dataset like is done so much in current research (e.g., colliders have undesired effect)
  • Need the causal model (DAG) to be able to deduce what should be controlled for based on the specific question that is asked
  • Adjustment Set: The variables determined appropriate to control for for a particular query

Why Bayesian?

  • Bayesian happens to be the easiest approach for generative models; it’s not because we’re stuck in a philosophical debate
  • Easiest way to take the scientific structure of the assumed model and generate it, since it naturally has direction (i.e., priors)
  • In most cases, Bayes can be appropriate (sometimes not–cut cake with chainsaw)
    • Measurement error, missing data, latent variables, regularization
  • It is practical, not philosophical

Owls

  • Classic joke: Step 1 = Draw two circles, Step 2 = draw remaining owl
    • Programming and technical things tend to be taught this way, but we want to avoid this and document all the intermediate steps
  • We need to have an explicit workflow with clear steps
  • We need to treat coding/scripting seriously, not just a means to something (apply software engineering principles, documentation, quality control)
  • Understand what you are doing, document your work and reduce error, have a respectable scientific workflow, be professional and organized to maintain reproducible scientific knowledge, otherwise it’s all bullshit
  • Workflow
    1. Theoretical Estimand (what are we trying to do?)
    2. Scientific (Causal) Model (DAG + Generative)
    3. Use 1 & 2 to build appropriate statistical model
    4. Simulate from 2 to validate that 3 yields 1
    5. Analyze the actual data

2. Garden of Forking Data

Week # Lecture # Chapter(s) Week End Notes Taken
1 2 2, 3 1/6/2023 1/6/2023

Summary

This scientific modeling framework provides an objective process to incorporate subjective (expert, scientfic) knowledge into the modeling process, enabling us to incorporate all of the uncertainty associated with those processes, predicated on the assumption of the causal model. Further, one of the key takeaways was that samples do not need to be representative of the population for us to provide good estimates. This is profound because generally we are taught the opposite, but because of the process, we can explicitly account for how we know/assume the data was generated, and use that information to create a good estimate of the quantity we are interested in. This is much more practical than the assumptions that are made in a typical frequentist analysis–which tend to be blindly made which ironically makes them more wrong than the “subjective” information in the generative approach. We can then use sampling of our posterior distribution(s) to answer questions about what might happen if we do another experiment, etc. (e.g., what if we take 10 more samples?). Instead of relying on asymptotics for the sampling distribution of a statistic (frequentist), we can just take samples from the posterior for any complex quantity of interest and get the uncertainty surrounding that. This is especially important once we are dealing with analytically intractable posteriors that don’t have closed form solutions. Instead of needing expert-level calculus knowledge for such problem, we just have to follow the same workflow as in this basic problem. After years of frequentist modeling, that is always full of limitations and disatisfaction in the results, this approach will lead to much more rewarding scientific discovery and confidence in the conclusions of research.

A check for understanding

Let’s go through and reproduce some of the content/concepts from slides but using our own explanation, implementation and interpretation along the way.

1. What is the objective?

The main question asked in the demonstration was what proportion of the globe is water?. Thus, the quantity we are interested in is a single quantity: the true proportion of of the globe that is water.

2. What is the sampling strategy?

We want to collect data to try to answer the question of interest. This will be done by spinning the globe and dropping a pin at a random location to indicate if it is either land or water. Some initial assumptions are

  • All points on the globe are equally-likely to be selected
  • Any given point on the globe is either land or water (only two possibilities)
  • There is no measurement error associated with indicating if the selected point was land or water

3. What is the generative model?

We want to consider the different variables at play here as it relates to any observed sample we get as a result of the sampling strategy. First and foremost, the primary unknown parameter is:

\[p=\text{proportion of water on the globe}\]

The other two at play (under this simplistic model) are:

\[N=\text{Number of globe spins} \hskip.5in W=\text{Number of spins resulting in water}\] Note that the number of spins resulting in land is just \(N-W\)

With the variables defined, the next step is determine how these variables relate to each other. We’ll use the following DAG:

flowchart LR
  A(p) --> C(W)
  B(N) --> C

This assumes that the number of water spins observed in our sample is determined by:

  1. The true proportion of water on the globe
  2. The total number of spins of the globe made (samples)

4. What is the statistical model/estimation procedure?

Let’s suppose we execute the sampling procedure which yields the following response vector:

Code
observed_sample <- c(1, 0, 1, 1, 1, 0, 1, 0, 1) # 1 = water; 0 = land
W <- sum(observed_sample) # Number of water samples
N <- length(observed_sample) # Number of spins
W; N
[1] 6
[1] 9

We just need to count all of the ways that this sample could have arose across all of the different possibilities of \(p\), and then estimate \(p\) as that of where the sample was most likely to have occurred.

Basic (incorrect) solution with finite possibilities

We know that there are infinitely many possibilities for \(p\). Let’s first go through this assuming the globe is that of a 4-sided die, such that each side is land or water, implying the only possibilities are \(p \in (0,.25,.50,.75,1)\). For each possible value of \(p\), what is number of ways we could have observed our sequence of data? (thinking of the generative process, starting with \(N\) and \(p\)).

First of all, we can set our possible set of parameter values, and the number of “sides” of the globe this implies (i.e., we’re saying that there are only 4 sides and each one is either Water or Land, so we have a limited number of \(p\) values that could occur).

Code
# Set possible values for p
p <- c(0, .25, .5, .75, 1)

# Number of sides of globe
sides <- length(p) - 1

For each of the 5 possible values of \(p\), how many combinations are there that produce our observed sequence of data?

Code
# Number of ways to observe sample for each p (this is the count of the possible sequences of indicators)
ways <- (sides*p)^W * (sides*(1-p))^(N-W)
ways
[1]   0  27 512 729   0

Now, of those possibilities, which was the most likely to occur?

Code
# Posterior probability
posterior_prob <- ways / sum(ways)
cbind(p, ways, posterior_prob)
        p ways posterior_prob
[1,] 0.00    0     0.00000000
[2,] 0.25   27     0.02129338
[3,] 0.50  512     0.40378549
[4,] 0.75  729     0.57492114
[5,] 1.00    0     0.00000000

It looks like \(p=0.75\) was the most likely value of those that are possible.

What is key to note about the posterior probabilities is that they are relative to the total across all values of \(p\). We simply found all of the raw counts associated with each \(p\) and then normalized them by the total to get the posterior probability. But this process was exactly the same thing as finding the likelihood of the data:

\[Likelihood = \prod_{i=1}^NP(X=x|p)\]

where \(X\) is the binary indicator from a single globe spin.

If we just look at all the possible sequences of indicators that could have occurred:

Code
# Total possible sequences of indicators (each one could be a 1 or a 0)
total_possible_sequences <- sides^N
total_possible_sequences
[1] 262144

And then divide our original combination counts by that, we’ll get exactly the likelihood of the data:

Code
# Divide the total number of combinations we could have saw our sample, by the total number of possibilities
likelihood <- ways / total_possible_sequences
likelihood
[1] 0.0000000000 0.0001029968 0.0019531250 0.0027809143 0.0000000000

However, as stated above, this will not change the resulting posterior distribution because the number we divided by was just a normalizing constant:

Code
likelihood / sum(likelihood)
[1] 0.00000000 0.02129338 0.40378549 0.57492114 0.00000000

So, we could also think of this problem in a different light (although it’s the SAME) and get the same result:

  1. We could think of each observed value as an (unfair) coin flip (according to the value of \(p\)) and calculate the likelihood of the sequence of flips (which is actually what we already did, but this is more of the “traditional” way to think about it):
Code
# Likelihood of sequence of observed sample
likelihood2 <- p^W * (1-p)^(N-W)
likelihood
[1] 0.0000000000 0.0001029968 0.0019531250 0.0027809143 0.0000000000
Code
# Compute posterior
likelihood2 / sum(likelihood2) # Same as before
[1] 0.00000000 0.02129338 0.40378549 0.57492114 0.00000000
  1. We could also think of this as finding the likelihood of observing the total number of water spins since each flip is independent. This is also the same as before, except we’re accounting for all of the combinations to observe the total number of water flips, not just the particular sequence:
Code
# Make the normalizing constant
normalizing_constant <- factorial(N) / (factorial(W)*factorial(N-W))

# Multiply the likelihood by the normalizing constant by the likelihood to get the true probability of the observed sample for each value of p
probability <- normalizing_constant * likelihood
probability
[1] 0.000000000 0.008651733 0.164062500 0.233596802 0.000000000
Code
# Compute the posterior
probability / sum(probability)
[1] 0.00000000 0.02129338 0.40378549 0.57492114 0.00000000

Note that the normalizing constant had no effect on the posterior, but it did calculate the correct probabilities of the observed sample. In fact, this was just a Binomial distribution:

Code
# What is the probability of observing W water values in a sample of N globe spins for each p?
dbinom(x = W, size = N, prob = p)
[1] 0.000000000 0.008651733 0.164062500 0.233596802 0.000000000

That is, the probability distribution for the number of water samples is:

\[W|p \sim Binomial(N, p)\] \[ \begin{equation} \begin{split} P(W|p) &= \binom{N}{W}p^W(1-p)^{N-W} \\ &= \frac{N!}{W!(N-W)!}p^W(1-p)^{(N-W)} \\ \end{split} \end{equation} \]

So what is going on here? We are after the distribution of probability weights associated with each possible value of \(p\) (which is what the posterior distribution is). In mathematical notation, we’re just applying Bayes’ formula:

\[ \begin{equation} \begin{split} P(p|sample) & = \frac{P(p)P(sample|p)}{P(sample)} \\ & = \frac{P(p)P(W|p)}{P(W)} \\ & = \frac{P(p)P(W|p)}{P(W \cap p = 0) + ... + P(W \cap p = 1)} \\ & = \frac{P(p)P(W|p)}{P(p=0)P(W|p=0) + ... + P(p=1)P(W|p=1)} \\ \end{split} \end{equation} \] Each value of \(p\) is equally-likely to occur (uniform prior), so we can factor out that term:

\[ \begin{equation} \begin{split} \text{(from previous)} & = \frac{P(W|p)}{P(W|p=0) + ... + P(W|p=1)} \\ (binomials) &= \frac{\binom{N}{W}p^W(1-p)^{(N-W)}}{\binom{N}{W}0^W(1-0)^{(N-W)} + ... + \binom{N}{W}1^W(1-1)^{(N-W)}} \\ & = \frac{p^W(1-p)^{(N-W)}}{0^W(1-0)^{(N-W)} + ... + 1^W(1-1)^{(N-W)}} \\ & = \frac{p^W(1-p)^{(N-W)}}{\text{Normalizing constant}} \\ \end{split} \end{equation} \] As you can see, the combination term also factors out, and the basic structure we’re left with is the likelihood piece that was found in all three (3) variations above: \(p^W(1-p)^{(N-W)}\). So when computing the posterior probability, they are relative to only terms dependent on the parameter of interest, so doesn’t matter if we use the counts, base likelihood, or the probability distribution–they are all the SAME. The counting process and the “forking data” approach is simply a means to breakdown the process of what’s happening behind the scenes in the math, so instead of just saying “do this integral” or “compute this product of the likelihood”, you’re picking apart each step of that process to gain intuition about what is happening. I’d imagine this is exactly the point of the Owl reference in the prior lecture.

Full solution: p is a continuous value

As mentioned before, the actual proportion of water on the globe can be any number between zero and one (\(p \in [0,1]\)), meaning that there are “infinite” sides to the globe. The derivation at the end of the previous section illustrates that the posterior distribution for \(p\) not restricted to any particular set of values. If we pick up where we left off:

\[ \begin{equation} \begin{split} P(p|data) & = \frac{p^W(1-p)^{(N-W)}}{\text{Normalizing constant}} \\ \end{split} \end{equation} \] All we would need to do for the continuous version of \(p\) to make the posterior a formal probability distribution is to find the normalizing constant such that the integral over all possible values of \(p\) equals 1. Formally, with respect to \(p\),

\[\int_0^1 \frac{p^W(1-p)^{N-W}}{Constant} = 1\] This will ensure that the probabilities across all possible values of \(p\) sums to one. However, it doesn’t actually matter that we find that constant necessarily, because the posterior probability is just relative to the range of values of \(p\). So all that really matters is:

\[P(p|data) \propto p^W(1-p)^{N-W}\] We can then plug in our data and graph the resulting distribution to make inferences about \(p\).

\[P(p|data) \propto p^6(1-p)^3\]

Code
tibble(
  p = seq(0, 1, .01), # Approximate the range of p values
  posterior = p^W*(1-p)^(N-W) # Compute the posterior
) %>%
  
  # Make a plot
  ggplot() +
  geom_line(
    aes(
      x = p,
      y = posterior
    )
  ) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  )

After using our sample of 9, the probability weight for \(p\) tends to focus near 0.70. Note that the scale of the y-axis was removed to emphasize that it doesn’t really matter what it is. We would just need to be able to calculate the area under the curve to be able to assign real probabilities to questions like “what is the probability that the proportion of water is less than 0.5?”.

Note: In some cases, if we used used a different priors on \(p\) (e.g., Beta), the posterior will turn out to be an identifiable distribution which we know the normalizing constant.

Updating the posterior

So when we talk “Bayesian updates” or updating the posterior distribution, what does this mean? Since the point of it is to be able to update a model with new information, my gut used to tell me that we were somehow adding our current knowledge about the parameter into the new prior distribution, and then updating the new posterior with an updated prior and only using new data in the likelihood. While in a way this might be the right way to think about it (i.e., if I have a posterior right now, isn’t that the most current knowledge about the parameter, so if I want to collect more data, wouldn’t I want to use knowledge up to this point as the prior instead of reverting back to the original prior and just adding more data to the collective sample?), in these examples we were doing something different: we’re just seeing how the posterior changes as more data is added to the sample (i.e., observed sequence of data points).

Let’s start with just focusing on the basic example (i.e., 4 sided-globe) for now. We just need to loop through the observed sample, and calculate the posterior probabilities for each value of \(p\) as a new observation comes in:

Code
# Set the prior probability (uniform over the possibly choices)
prior <- rep(1 / length(p), length(p))

# Set the current posterior as the prior (before any data collected)
last_posterior <- prior

# Make result set
results <- tibble()

# For each value in the observed sample 
for(i in 1:N) {
  
  # 1. Get the sub-sample
  sub_sample <- observed_sample[1:i]
  
  # 2. Compute metrics (the number of water samples, and the total number of spins)
  W_temp <- sum(sub_sample)
  N_temp <- length(sub_sample)
  
  # 3. Compute the likelihood for each p
  temp_likelihood <- p^W_temp * (1 - p)^(N_temp - W_temp)
  
  # 4. Posterior
  temp_posterior <- temp_likelihood / sum(temp_likelihood)
  
  # 5. Add to results
  results <-
    results %>%
    bind_rows(
      tibble(
        sample = i,
        sequence = paste(sub_sample, collapse = ","),
        p,
        likelihood = temp_likelihood,
        current = temp_posterior,
        last = last_posterior
      )
    )
  
  # Set the new last posterior
  last_posterior <- temp_posterior

}

results %>%
  
  # Send down the rows
  pivot_longer(
    cols = c(last, current)
  ) %>%
  
  # Make a plot
  ggplot() +
  geom_col(
    aes(
      x = factor(p),
      y = value,
      fill = name
    ),
    color = "black",
    alpha = .75,
    width = .25,
    position = "identity"
  ) +
  facet_wrap(
    ~paste0("Spin: ", factor(sample), " \nSample: ", sequence)
  ) +
  theme(
    legend.position = "top",
    panel.background = element_blank(),
    panel.grid.major.y = element_line(colour = "gray")
  ) +
  xlab("p") +
  ylab("Posterior Probability") +
  labs(
    fill = "Posterior"
  ) +
  scale_fill_manual(
    values = c("blue", "darkgray")
  ) 

The blue bars show the posterior probability for each possible value of \(p\) after the newest observation was made, and the gray bars show it before the newest observation was made. This illustrates the incremental impact of adding more data to the sample on the resulting posterior distribution.

We can apply this same process to the continuous (correct) possible set of values for \(p\) (in fact, we’ll create the curves by performing the exact same procedure to a larger, discrete set of values but make the display appear continuous):

Code
# Approximate the set of inifinite p-values by a large set of discrete ones
p_continuous <- seq(0, 1, .01)

# Set the prior probability (uniform over the possibly choices)
prior <- rep(1 / length(p_continuous), length(p_continuous))

# Set the current posterior as the prior (before any data collected)
last_posterior <- prior

# Make result set
results <- tibble()

# For each value in the observed sample 
for(i in 1:N) {
  
  # 1. Get the sub-sample
  sub_sample <- observed_sample[1:i]
  
  # 2. Compute metrics (the number of water samples, and the total number of spins)
  W_temp <- sum(sub_sample)
  N_temp <- length(sub_sample)
  
  # 3. Compute the likelihood for each p
  temp_likelihood <- p_continuous^W_temp * (1 - p_continuous)^(N_temp - W_temp)
  
  # 4. Posterior
  temp_posterior <- temp_likelihood / sum(temp_likelihood)
  
  # 5. Add to results
  results <-
    results %>%
    bind_rows(
      tibble(
        sample = i,
        sequence = paste(sub_sample, collapse = ","),
        p_continuous,
        likelihood = temp_likelihood,
        current = temp_posterior,
        last = last_posterior
      )
    )
  
  # Set the new last posterior
  last_posterior <- temp_posterior
  
}

results %>%
  
  # Send down the rows
  pivot_longer(
    cols = c(last, current)
  ) %>%
  
  # Make a plot
  ggplot() +
  geom_area(
    aes(
      x = p_continuous,
      y = value,
      fill = name
    ),
    color = "black",
    alpha = .65,
    position = "identity"
  ) +
  facet_wrap(
    ~paste0("Spin: ", factor(sample), " \nSample: ", sequence)
  ) +
  theme(
    legend.position = "top",
    panel.background = element_blank(),
    panel.grid.major.y = element_line(colour = "gray"),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  xlab("p") +
  ylab("Posterior Probability") +
  labs(
    fill = "Posterior"
  ) +
  scale_fill_manual(
    values = c("blue", "darkgray")
  ) 

Again, these curves are actually approximated here. In practice, we would need to calculate the area underneath the curve to get exact answers about probabilities. Note: We could parameterize the Beta distribution to get the normalizing constant for calculating the actual posterior probabilities that fits this distribution.

Homework

  1. Suppose the globe tossing data had turned out to be 4 water and 11 land. Construct the posterior distribution.

We’ll cheat a little bit and use the approach of using a large, discrete list of possible values for \(p\), and plot it as if it is continuous (we could also just use the Beta distribution). With that, all we need to do is change the values of \(W\) and \(N\) and use the same code as above.

Code
W_new <- 4
N_new <- 15
tibble(
  p = seq(0, 1, .01), # Approximate the range of p values
  posterior = p^W_new*(1-p)^(N_new-W_new) # Compute the posterior
) %>%
  
  # Make a plot
  ggplot() +
  geom_line(
    aes(
      x = p,
      y = posterior
    )
  ) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  )

  1. Using the posterior distribution from (1), compute the posterior predictive distribution for the next 5 tosses of the same globe.

Okay here I’ll finally acknowledge that the posterior can be written as a Beta distribution, which gives us the normalizing constant needed to make it a real probability distribution (i.e., the area sums to 1). It has the following probability density function (PDF):

\[f(x|\alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}\] where \(x \in [0,1]\), \(\alpha, \beta > 0\), and \(\Gamma(n) = (n-1)!\). If we make the following reparameterizations from our set of variables:

\[p=x\] \[W = \alpha - 1\] \[N-W = \beta - 1\] we get:

\[ \begin{equation} \begin{split} f(p|W,N) & = \frac{\Gamma(W+1+N-W+1)}{\Gamma(W+1)\Gamma(N-W+1)}p^W(1-p)^{N-W} \\ & = \frac{\Gamma(N+2)}{\Gamma(W+1)\Gamma(N-W+1)}p^W(1-p)^{N-W} \\ & = \frac{(N+1)!}{W!(N-W)!}p^W(1-p)^{N-W} \\ \end{split} \end{equation} \]

Note that since \(p\) is continuous, this function represents the probability density at any particular value of \(p\). To get any positive probability, we must integrate this function over a range of \(p\) values. Hence the \(f\) notation.

Let’s quickly plug in \(W=4\) and \(N=15\) to confirm (at least visually) that this function produces the same posterior we made to answer the last question. We’ll do this by evaluating the density of the derived Beta distribution at range of possible \(p\) values. Note we’ll have to use the parameterizations for the Beta distribution that are built into R.

Code
# Reparameterize
alpha <- W_new + 1
beta <- N_new - W_new + 1

# Make a data frame
tibble(
  p = seq(0, 1, .01), # Approximate the range of p values (same as before)
  posterior = dbeta(p, shape1 = alpha, shape2 = beta) # Compute the actual posterior density
) %>%
  
  # Make a plot
  ggplot() +
  geom_line(
    aes(
      x = p,
      y = posterior
    )
  )

From inspection, it appears that they are essentially the same curves. Note that this time I kept the y-axis labels because this is the density for the actual probability distribution.

So back to the question: how do we find the posterior predictive distribution for the next 5 globe spins? Well, if the globe is spun 5 more times, then we can get water on 0, 1, 2, 3, 4, or all 5 spins. Our quest is to figure out the likelihood of each of those possible outcomes based on what we currently know about \(p\) (i.e., the posterior distribution). To do this, we’ll use our posterior distribution to run a simulation of the experiment using the following steps:

  1. Select a random value of \(p\) from the posterior
  2. Draw a random \(binomial\) realization where \(N=5\)
  3. Repeat steps i-ii 10000 times
  4. Graph the results
Code
# Set some parameters
set.seed(123)
n_experiment <- 5 # Number of new spins we're going to conduct
s <- 10000 # Number of simulations

# 1. Draw random values of p from the posterior
p_rand <- rbeta(n = s, shape1 = alpha, shape2 = beta) # Same parameters as before

# 2. For each p, run a binomial experiment (represents samples of W)
w_rand <- rbinom(n = s, size = n_experiment, prob = p_rand)

# Make a data frame
tibble(
  w = w_rand
) %>%
  
  # For each w
  group_by(w) %>%
  
  # Compute the total
  summarise(
    count = n()
  ) %>%
  
  # Add proportion
  mutate(
    proportion = count / sum(count)
  ) %>%
  
  # Make a plot
  ggplot(
    aes(
      x = factor(w)
    )
  ) +
  geom_col(
    aes(
      y = proportion
    )
  ) +
  geom_text(
    aes(
      y = proportion,
      label = paste0(round(proportion*100,1), "%")
    ),
    vjust = -.1
  ) +
  scale_y_continuous(
    labels = scales::percent
  ) +
  xlab("W") +
  ylab("Probability") +
  labs(
    title = paste0("Posterior Predictive Distribution for ", n_experiment, " more spins.")
  )

Another way to think about what we’re doing here is:

  1. Find the collection of density values for all \(p\) in the posterior distribution
  2. Find the probability distribution of the possible outcomes from a Binomial distribution where \(N=5\) for all possible values of \(p\) (i.e., independent of our posterior)
  3. Take the average probability value for each possible outcome in (ii) over all values of \(p\), weighted by the posterior density in (i)
Code
# Make a set of p representive of its domain
p_alt <- seq(0, 1, .001) # Supposed to represent continuous p

# 1. Posterior density values for each p
posterior_alt <- dbeta(p_alt, shape1 = alpha, shape2 = beta)

# 2. Probability distribution for each outcome for each p
possible_outcomes <- 0:n_experiment
likelihood_alt <- 
  possible_outcomes %>% 
  map_df(
    ~
      tibble(
        binomial_probability = dbinom(x = .x, size = n_experiment, prob = p_alt),
        p = p_alt,
        outcome = .x
      )
  )

# 3. Get the posterior predictive distribution
posterior_predictive_distribution <-
  likelihood_alt %>%
  
  # Join to attach the posterior density weight to each value of p
  inner_join(
    y = 
      tibble(
        p = p_alt,
        posterior_density = posterior_alt
      ),
    by = "p"
  ) %>%
  
  # For each possible outcome
  group_by(outcome) %>%
  
  # Compute the weighted-average probability
  summarise(
    posterior_probability = sum(binomial_probability * posterior_density) / sum(posterior_density)
  )

# 4. Make a plot
posterior_predictive_distribution %>%
  ggplot(
    aes(
      x = factor(outcome)
    )
  ) +
  geom_col(
    aes(
      y = posterior_probability
    )
  ) +
  geom_text(
    aes(
      y = posterior_probability,
      label = paste0(round(posterior_probability*100,1), "%")
    ),
    vjust = -.1
  ) +
  scale_y_continuous(
    labels = scales::percent
  ) +
  xlab("W") +
  ylab("Probability") +
  labs(
    title = paste0("Posterior Predictive Distribution for ", n_experiment, " more spins.")
  )

The distributions from the two approaches are merely identical (slight differences due to simulation variability and inexact integration).

  1. Use the posterior predictive distribution from (2) to calculate the probability of 3 or more water samples in the next 5 tosses.

Using the posterior predictive distribution above, we can look at the percent of simulations that resulted in 3 or more water samples:

Code
mean(w_rand >= 3)
[1] 0.1771

So there is a 0.177 probability of 3 or more water samples in the next 5 tosses. However, this point estimate is not totally sufficient because we haven’t reported any uncertainty associated with it. Since we know that \(W|p \sim Binomial(n,p)\), the \(P(W>=3|p,N)\) is already determined. In this case, we can just calculate it for each random value of \(p\) sampled from the posterior:

Code
# Compute the binomial probability
prob_binom <- 1 - pbinom(q = 2, size = n_experiment, prob = p_rand)

# Make a plot
ggplot() +
  geom_density(
    aes(
      x = prob_binom
    )
  ) +
  theme(
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  xlab("P(W>=3|N=5)") 

This one has been stumping me a bit, and I’m not totally confident this is the correct result. Another way I was thinking about it was similar the alternative in (2), namely that we are finding the \(P(W>=3|N=5)\) for each value of \(p\), and then weighting that by the posterior density of \(p\). However, it seems like we should be sampling from the predictive distribution in (2) somehow, but if we do that it seems like the precision of our estimates would then just be determined by the number of simulations we run, not the data, which also doesn’t make sense.

Notes

Goal: Estimate the percent of the globe that is covered in water

  • Think of spinning the globe and stopping on a point and repeating many times
  • How do we use that collection of points to come up with an estimate? That’s the goal of today’s lecture
  • First thought is just indicate each time whether land or water appear as the point; however, how does the shape of the globe impact the likelihood that I will come up with land or water on a “random” toss? Has to do with sampling strategy
  1. Define a generative model
  • Think conceptually about scientifically how the sample was produced (how do variables influence one another)
  • Variables: Things we want to observe/estimate or things we actually do observe

\[\bf{p} = \text{proportion of water}\hskip.5inW=\text{water observations}\] \[N = \text{number of tosses}\hskip.5inL=\text{land observations}\]

  1. Define a specific estimand

Were interested in the true proportion of water p

  1. Design a statistical way to produce estimate
  • How are these related to each other?
    • N influences W and L (the more tosses leads to change on other variables)
    • p also influences W and L (i.e., the true proportion dictates the number of water observations and land observations)
    • The DAG shows relationships, but not what the relationships are. We can say \(W,L=f(p,N)\); what is \(f\)?
  • Assume a model (e.g., \(p\) = .25, then count likely the sample was under that model, do that for all possible models)
  1. Test (3) using (1)
Code
sim_globe <-
  function(p = .7, N = 9) {
    sample(
      c("W","L"), # Possible observations
      size = N, # Number of tosses
      prob = c(p, 1-p), # The probability of each possible observation
      replace = TRUE)
  }
sim_globe()
[1] "W" "W" "W" "W" "W" "W" "W" "W" "W"
Code
replicate(sim_globe(p =.5, N=9), n=10)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] "W"  "L"  "W"  "W"  "L"  "W"  "L"  "W"  "L"  "W"  
 [2,] "L"  "L"  "L"  "W"  "L"  "L"  "W"  "L"  "L"  "W"  
 [3,] "L"  "L"  "W"  "W"  "W"  "W"  "W"  "W"  "W"  "W"  
 [4,] "W"  "W"  "W"  "L"  "W"  "L"  "L"  "W"  "W"  "W"  
 [5,] "W"  "L"  "W"  "W"  "L"  "W"  "W"  "W"  "W"  "L"  
 [6,] "W"  "W"  "L"  "W"  "L"  "W"  "L"  "L"  "W"  "W"  
 [7,] "L"  "L"  "W"  "W"  "W"  "W"  "L"  "L"  "L"  "W"  
 [8,] "W"  "L"  "L"  "W"  "L"  "W"  "W"  "W"  "W"  "W"  
 [9,] "W"  "L"  "W"  "L"  "L"  "W"  "L"  "W"  "W"  "L"  
  • Test the intent of the code first
  • If our procedure doesn’t work when we know the answer, it certainly won’t when we don’t know the answer

Infinite sample:

\[p^W(1-p)^L\] Posterior probability:

\[p = \frac{(W+L+1)!}{W!L!}p^W(1-p)^L\]

  • This is a Beta distribution, and the likelihood was a Binomial.
  • The minimum sample size for Bayesian analysis is 1.
  • The shape of the posterior distribution embodies the sample size
  • No point estimate, we work with the entire posterior distribution
  • The distribution is the estimate; always use the entire distribution, never a single point
  • The fact that an arbitrary interval contains an arbitrary value is not meaningful
  1. Analyze sample, summarize
  • Implications depend on entire posterior
  • Average over the uncertainty of the posterior
  • What can we do with the posterior distribution?
    • We can take samples from it, and then do calculations with the samples
  • Posterior Prediction
    • Given what we’ve learned, what would happen if we took more samples?
    • Sampling distribution (predictive distribution) of draws represents the likelihood of each outcome in a new experiment for a particular value
    • The posterior predictive distribution then represents the entire distribution of the statistic of interest, and contains all the uncertainty around that estimate (analogous to the sampling distribution of a statistic (e.g., mean) in the frequentist paradigm, except this is completely model-driven by the posterior instead of based on asymptotics in the frequentist approach)
    • Sampling turns calculus into a data summary problem; this is important when models get complex and numerically intractable to compute by hand
  • This generative, Bayesian framework is the optimal approach for causal estimation if your model is correct.
  • It honestly carries out the assumptions we put into it, using logical implications
  • Quantitative framework/asset that activates our qualitative knowledge as scientists, subject matter experts, etc. Let’s the subjective and objective work together. Subjectivity is expertise.

Misclassification

  • Use circles around variable in DAG to represent unobserved vs. observed variables

  • Imagine the true number of water samples (W) are unobserved (e.g., measurement error, data error, etc.)

  • We observe a contaminated W (called W*) that is the misclassified sample

  • W* is caused by the measurement process M. We can get get back to the correct posterior distribution for p if we use M through W*.

  • The posterior is honest about the uncertaintly induced by the misclassification process

  • When there is measurement error, model it instead of ignoring it (same for missing data, compliance, inclusion)

  • Key point: Samples do not need to be representative of population to provide good estimates, since we can correct them through our causal diagram (modeling the source, sampling process, etc.)

  • This concept may also arise if, for example, the globe was not spun equally likely for every point to be selected.

3. Geocentric Models

Week # Lecture # Chapter(s) Week End Notes Taken
2 3 4 1/13/2023 1/10/2023

Summary

We don’t actually need data to construct a model. Our prior distributions, which account for our baseline knowledge about what reasonable values for unknown parameters may be, can produce estimates on their own. A bare minimum strategy is to choose these such that before seeing data, the output of our model produces scientifically reasonable results–there is no reason to allow our model to produce results that we know cannot happen. Then, our data can be introduced to help guide the parameters to an area of focus. In this sense (thinking of the example of points bumping around in parameter space), the data we collect is really just a tool for our model–the model is the central focus, the data just helps the model go to where it needs to go. Also, the idea that there are no correct priors and that priors are just (normalized) posteriors from previous data, make the idea of Bayesian updating very intuitive. It will be interesting to see in coming lectures how we can extend this linear model framework to more “real life” problems with observational data that have potentially tens or hundreds or thousands of potential drivers, and strategies for accounting for the most important ones. Obviously these basic examples are great to build a foundation, but it seems like a huge (sometimes impossible) hurdle to have the time and resources to be able to fully vet out expert-driven causal diagrams and generative models that fully account for all the things, especially in fast-paced environments when everyone is just so busy and there are so many projects to attend to. I’d imagine this is one of the reasons why frequentist analysis persists so much (at least in medical research), because it’s the way it’s been done and therefore you can get more things done faster, even though in an ideal state a Bayesian approach is the right way to go. Definitely something I’ve thought about time and time again–how can we balance the rigor and detail needed to construct the appropriate models to achieve better inference while still being efficient with peoples’ time? Part of it probably has to do with proving to stakeholders that the inference gained from the “quicker” way is less informative (or just plain wrong) compared to the more involved approach.

Notes

  • Statistical models can attain arbitrarily accurate predictions without having any explanation or accurate structure (i.e., the model is just plain wrong, but happens to produce accurate predictions at the right time)
    • Example of this is a previous explanation of orbit pattern of Mars: assuming Earth at the center (geocentric), Mars orbits around Earth but also it’s own local orbit (epi-cycles). Using this model, they got very accurate predictions, but this mechanism is completely wrong.
    • Orbits are actually elliptical and around the sun, not Earth
    • Even though the first one predicts accurately, because the structure/mechanism is wrong, it doesn’t extend or generalize to other things. However, the correct mechanism is able to explain orbit patterns of all planets in the solar system.
  • Linear regression is a large class of statistical golems
    • Geogentric: describes associations, makes good predictions; mechanistically always wrong (but useful), very good approximation; meaning doesn’t depend on the model, depends on an external causal model. Nothing wrong with it unless you actually believe it is the true mechanism.
    • Gaussian: Abstracts away from detail of general error model; mechanistically silent. General argument about symmetry of error.

Gaussian

  • Example: Flip coin, each person take a step to left or right depending on heads/tails, measure distance from center; makes a normal distribution. Why?
    • There are more ways for a sequence of coin tosses to get you close to the middle than there are to get you to the left or right
    • Many natural processes attract to this behavior because it is adding together small differences
  • Two arguments:
    • Generative: summed fluctuations tend towards normal. Ex. growth–added fluctuations over time, same age weight tends to be gaussian
    • Inferential: estimating mean/variance. Best to use since least informative (maximum entropy)
  • Variable does not need to be normally distributed for normal model to be useful. Machine for estimating mean/variance. Contains the least assumptions. (central limit theorem)

Skills/Goals for Lecture

  1. Learn a standardized language for representing models (generative and statistical)
  2. Calculate posteriors with multiple unknown parameters
  3. How to construct and understand linear models; how to construct posterior predictions from them

Reminder of the owl

  1. State a clear question; descriptive, causal, anything; but needs to be clear
  2. Sketch causal assumptions using DAGs; good way for non-theorists to realize they have a lot of subject knowledge and can get it on paper
  3. Define a generative model; generates synthetic observations
  4. Use generative model to build estimator; causal/generative assumptions embedded
  5. Test, analyze
  6. Profit: we realize our model was useful, or terrible; either way we gain something

Describing models

  1. Lists variables
  2. Define each variable as a deterministic or distributional function of other variables

Exercise

  1. Goal: Describe the association between adult weight and height
  2. Height causes weight H–>W<–(U) (unobserved influences on body weight)
  3. Generative/scientific model: \(W=f(H,U)\), \(W=\beta H + U\)
Code
sim_weight <-
  function(H,b,sd) {
    U <- rnorm(length(H),0,sd)
    W<-b*H + U
    return(W)
  }
# Generate height
H <- runif(200,130,170)
W <- sim_weight(H, b=.5, sd= 5)
plot(W~H,col=2, lwd = 3)

\[W_i=\beta H_i + U_i\] \[U_i \sim Normal(0,\sigma)\] \[H_i \sim Uniform(130, 170)\] 4. Statistical model (estimator)

  • We want to estimate how the average weight changes with height.

\[E(W_i|H_i)=\alpha + \beta H_i\]

  • Posterior distribution

\[P(\alpha, \beta, \sigma|H_i,W_i) = \frac{P(W_i|H_i,\alpha,\beta,\sigma)P(\alpha,\beta,\sigma)}{Z}\]

  • Gives the posterior probability of a specific regression line
    • Likelihood: Number of ways we could produce \(W_i\), given a line
    • Prior: The previous posterior distribution; normalized number of ways previous data could have been produced.

\[W_i \sim Normal(\mu_i, \sigma)\] \[\mu_i = \alpha + \beta H_i\]

  • Generally more useful to look at the lines (parameter implications together), instead of individual parameters

  • Quadratic approximation

    • Approximate the posterior distribution using a multivariate Gaussian distribution
    • Use the quap function in the rethinking package

Prior Predictive Distribution

  • Should express scientific knowledge, but softly
  • We can make the model make predictions without using data
  • Not make ranges that represent the data, but rather just those that make sense based on current knowledge
  • Account for basic reasonable constraints: In general, patients with more weight have more height, and the weight is less than the height, so \(\beta\) is probably between \([0,1]\).
  • Use these to define some lines based on the assumptions
Code
n <- 1000
a <- rnorm(n,0,10)
b <- runif(n,0,1)
plot(NULL,xlim=c(130,170),ylim=c(50,90),xlab="height(cm)",ylab="Weight(kg)")
for (j in 1:50) abline(a=a[j],b=b[j],lwd=2,col=2)

  • Some of these are probably not plausible (e.g., high height with low weight). Slopes look good but not intercept
  • We can adjust as needed to create what makes sense
  • There are no correct priors; only scientifically justifiable priors
  1. Validate Model
  • Bare minimum to test statistical model
  • Not because you wrote it, more so to make sure your model works
  1. Analyze data
  • Plug in your data set into your process
  • Parameters are not independent, can’t interpret as such
  • Push out posterior predictions

4. Categories and Curves

Week # Lecture # Chapter(s) Week End Notes Taken
2 4 4 1/13/2023 1/11/2023

Summary

The idea of total vs. direct effects is about specifying the statistical model that will allow you to observe the complete effect (i.e., including differences that could be explained by something else in the model) compared to parsing out differences explained by the variable after adjusting for effects explained through other variables. In the lecture example, the total causal effect of sex on weight was determined by using a (Bayesian) intercept-only model, which showed considerable difference is mean weights between male/female. However, when assessing the direct causal effect, a parameter was added to fit separate slopes for male/female in order to block out the effect of sex on weight that is observed through other causes (in this case, height), such that the resulting estimator looked at mean differences in weight at each height–the posterior distribution for this difference yielded little to no direct effect, indicating that most of the difference in weight between male/females is due to height differences. Another interesting aspect of this lecture was how to think about which way an arrow should go when drawing the causal diagram. You should think of the interventions we are willing to consider, and which make logical sense. For example, we drew \(H \rightarrow W\) because, given a height, it makes sense to employ interventions (such as weight loss program, exercise, etc.) that could presumably impact the resulting weight, but it doesn’t make a lot of sense to think of trying to change someone’s height given their weight. Also, declaring something as a cause of something, generally you first want to think about whether an intervention can be employed, but if not can still make sense if it is a proxy for something else (e.g., age encapsulates time, among many other things that presumably do cause height). We can use flexible curves to fit things (e.g., splines), but we want to make sure we vet out any erroneous areas where estimates don’t make sense, and add necessary restrictions to alleviate. So far, these lectures have given great optimism and excitement for how to approach modeling. I want to be confident in the models I produce, and I think the generative framework is the right approach to be able to believe in the results you are producing. I see so much published research from observational data that declare something statistically significant for a given research hypothesis and say “we adjusted for all these confounders”. Even if I feel fine about the math/statistical procedure, I’m always skeptical about the conclusions that are drawn from it, and quite frankly, don’t feel like it means much at all for really making a decision–there are just too many limitations about all sorts of things. The generative approach gives the tools and rigor to be much more confident in the results, and if we can be more demanding of that rigor, time and energy, it should yield more benefit in the long run. I’d rather spend more time getting to a confident conclusion than just pumping out results.

A check for understanding

During (and after) the lecture, it took me a while to gain intuition about what was happening in the generative simulation for the model:

flowchart LR
  A(Sex) --> C(Weight)
  A --> B(Height)
  B --> C

The code was written in the following way:

Code
# S = 1 female, S = 2 male
sim_HW <- function(S,b,a) {
  N <- length(S)
  H <- ifelse(S==1,150,160) + rnorm(N,0,5)
  W <- a[S] + b[S]*H + rnorm(N,0,5)
  data.frame(S,H,W)
}

# Generate data
set.seed(123)
S <- rbinom(100,1,.5) + 1
dat <- sim_HW(S, b=c(.5,.6), a=c(0,0))
head(dat)
  S        H        W
1 1 151.2666 79.57199
2 2 159.8573 99.75957
3 1 149.7856 76.55384
4 2 166.8430 95.06392
5 2 158.8711 94.72542
6 1 157.5824 77.38920

First, the indexing used here b[S] was odd because b is a vector of length 2, and S is a vector of length 100. But all it is doing is making a vector of length 100 by looking up the index of b at each spot (since S is either 1 or 2). I didn’t know you could index like that in R but I guess you learn something everyday. Anyway, that was not the real thing that confused me.

In the lecture, he states that the a term represents the direct effect of sex on weight, and the b term represents the indirect effect (i.e., proportionality/slope for each sex). It’s clear that there are separate lines created for each sex, and you can see the form of an intercept and slope for each one. In my mind, I’m thinking this has to be similar to an interaction in the model, but it wasn’t intuitive to me how this really played out and/or there was something different going on here. After some thought on a notepad, it is exactly what I was thinking–just a linear model with an interaction term between sex and height, though it is reparameterized a little to create the symmetry of effects as discussed in the lecture. Anyway, here is how it translates:

Currently, we have that the sex indicators are as follows:

\[S = 1 (female), 2(male)\] Then, the effect of height on weight for each sex is as follows:

\[b=(b_{S_1},b_{S_2}) = (0.5, 0.6)\] Finally, the intercept within each line is:

\[a=(a_{S_1},a_{S_2})=(0,0)\] This leads to:

\[W_{S_1} = a_{S_1} + b_{S_1}H + \epsilon_i = .5H+\epsilon_i\] \[W_{S_2} = a_{S_2} + b_{S_1}H + \epsilon_i = .6H + \epsilon_i\] We could think of this as a single model equation with four (4) regression coefficients looking like the following:

\[W = \beta_1 S_1 + \beta_2 S_2 + \beta_3 H \times S_1 + \beta_4 H \times S_2 + \epsilon_i\] where

\[S_1 = 1 \text{ if female; 0 otherwise}\] \[S_2 = 1 \text{ if male; 0 otherwise}\] There is no intercept term in the model. Instead, there are symmetric parameterizations for males and females, instead of making the effects relative to one another (which, as mentioned in the lecture, makes it more intuitive to make priors for). The design matrix for this model would then look something like:

Code
tribble(
  ~S1, ~S2, ~H_S1, ~H_S2, ~W,
  1, 0, 150, 0, 80,
  0, 1, 0, 160, 90,
  0, 1, 0, 140, 70,
  1, 0, 165, 0, 75
)
# A tibble: 4 × 5
     S1    S2  H_S1  H_S2     W
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     0   150     0    80
2     0     1     0   160    90
3     0     1     0   140    70
4     1     0   165     0    75

Basically, every time S1 is 1 (female), then S2 is 0 (male), as well as the corresponding value for H.

I may have to rethink what I just wrote a bit as the the \(S_1\) and \(S_2\) columns are actually completely redundant, so not sure if this is right yet

So how would we parameterize this in a more classical regression model? If we just rewrite a few things. Let:

\[S=(0(female), 1(male))\] Then we assume the model is:

\[W = \beta_0+\beta_1 S+\beta_2 H+\beta_3 S\times H + \epsilon_i\] If we translate parameter values from the example,

\[\beta_0 = 0 \hskip.1in \beta_1 = 0\] \[\beta_2 = .5 \hskip.1in \beta_3 = .6-.5 = .1\] We then get:

\[ \begin{equation} \begin{split} W_{S=0(female)} &= \beta_0 + \beta_1(0) + \beta_2H+\beta_3 0 \times H + \epsilon_i \\ &= 0 + 0(0) + .5H + .1(0 \times H) + \epsilon_i \\ &= 0.5H + \epsilon_i \end{split} \end{equation} \]

\[ \begin{equation} \begin{split} W_{S=1(male)} &= \beta_0 + \beta_1(1) + \beta_2H+\beta_3 1 \times H + \epsilon_i \\ &= 0 + 0(1) + .5H + .1(1 \times H) + \epsilon_i \\ &= 0.5H + 0.1H + \epsilon_i \\ &= 0.6H + \epsilon_i \end{split} \end{equation} \]

This makes it much more clear why the direct effect is zero, since the main effect of sex is zero in this model. We only see an effect from sex through the interaction with height, which is what is known as the indirect effect.

Homework

  1. From the Howell1 dataset, consider only the people younger than 13 years old. Estimate the causal association between age and weight. Assume age influences weight through two paths. First, age influences height, and height influences weight. Second, age directly influences weight through age-related changes in muscle growth and body proportions. Draw the DAG that represents these causal relationships. And then write a generative simulation that takes age as an input and simulates height and weight, obeying the relationships in the DAG.

First, we’ll import the dataset directly from the package’s Github repository. We will filter to those < 13 years old, and convert height to inches, and weight to lbs:

Code
howell1 <-
  read_delim(
    file = "https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv",
    delim = ";"
  ) %>%
  
  # Keep those younger than 13
  filter(age < 13) %>%
  
  # Convert the units (more intuitive for me)
  mutate(
    height = height / 2.54,
    weight = weight * 2.205
  )
Rows: 544 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
dbl (4): height, weight, age, male

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
howell1
# A tibble: 146 × 4
   height weight   age  male
    <dbl>  <dbl> <dbl> <dbl>
 1   48     43.3  12       1
 2   41.5   30.8   8       0
 3   34     23.1   6.5     0
 4   43     35.3   7       0
 5   45     39.4  11       1
 6   48     45.0   8       1
 7   50.8   51.5  12       0
 8   38.5   29.3   5       0
 9   43.5   34.0   9       0
10   38.5   28.1   5       0
# ℹ 136 more rows

Next, we can write the causal diagram as described:

flowchart LR
  A(Age) --> C(Weight)
  A --> B(Height)
  B --> C

If we let

\[A=Age \hskip.1in H=Height \hskip.1in W=Weight\]

Then, we assume that:

\[H = f_H(A) \hskip.2in W = f_W(A,H)\] Finally, we can write the generative simulation to produce synthetic data governed by the DAG:

Code
# Simulating synthetic children
sim_children <-
  function(N) {
    
    # 1. Generate uniform ages
    A <- runif(N, 0, 13)
    
    # 2. Generate heights as a linear combination of age
    H <- 22 + 2*A + rnorm(N, mean = 0, sd = 1)
    
    # 3. Generate weights as a linear combination of age and height
    W <- .8*H + rnorm(N, mean = 0, sd = .5)
    
    # Make a data frame
    tibble(A, H, W)
  }

We first generate ages uniformly from 0 to 13, so

\[A \sim Uniform(0,13)\]

Then, we generate heights from a normal distribution with means that are linearly related to the age.

\[H \sim Normal(\mu =22 + 2 \times A, \sigma = 3)\] Notice the intercept term to ensure a positive height for children who are 0 years old. Note: A distribution like the Gamma may be better here to ensure we don’t get negative heights. For younger ages, I would assume that the distribution of heights has a little right-skew. In any case, we’ll move forward with the Normal distribution here for simplicity sake.

Then we generate the weights as a linear function of the observed age and heights.

\[W \sim Normal(\mu=.8H, \sigma = .5)\] We make the assumption that there is a linear relationship between weight with age and height, and that for any age, the increase in mean weight per inch increase in height is the same. In fact, the effect for age is 0 since it is observed through height.

Code
# Simulate some children
set.seed(123)
simmed_children <- sim_children(200)
plot(simmed_children)

  1. Use a linear regression to estimate the total causal effect of each year of growth on weight.

  2. Now suppose the causal association between age and weight might be different between boys and girls. Use a single linear regression, with a categorical variable for sex, to estimate the total causal effect of age on weight separately for boys and girls. How do boys and girls differ? Provide one or more posterior contrasts as a summary.

Notes

  • The linear regression can approximate anything, so we need to design it with the causal model in mind
  • Generative models + multiple estimands, we’ll have multiple estimands
  • Need post-processing of posterior distribution to gain inference of joint distributino
  • We require categories, splines, etc. to build causal estimators
  • Need to stratify by category to get at the estimands we want (separate lines)

Example

  • Extend example above to include patient sex, age
  • Need to determine how height, weight, sex are causally related (add to DAG), and statistically related
  • To determine which way the arrows go, think about the interventions you’re willing to consider
  • Don’t have to draw them, but the implied unobserved causes of each variable are implied
    • These are ignorable unless shared across variables
    • Ex. temperature is a cause of sex and weight in some species
  • What is the causal effect of S on W?
    • Accounts for direct and indirect effect
    • We can also ask what is the direct causal effect of S on W?
    • These questions require different models
  • Generally want to assign the same prior for parameters for each category level (below)
    • Using indexing is advantageous because you have symmetry such that all parameters can get the same prior, they are all interpreted the same within their levels
    • Using indicators makes parameters relative to other levels, which causes you have to put priors on other parameters because it is an adjustment parameter (one is an average, one is an adjustment to an average)

\[W_i \sim Normal(\mu_i, \sigma) \hskip.1in \mu_i=\alpha_{S[i]}\] \[\alpha = [\alpha_1, \alpha_2] \hskip.1in \alpha_j \sim Normal(60,10)\] Total Causal Effect

  • Simulate one data set of all males, another of all females, look at the average difference in weight
    • This is the actual causal effect
    • Then you can generate a random data set, run the modeling process, and then ensure that the model provides the expected estimate
  • Look at the posterior distribution of the mean difference, and randomly draw samples from the individual posteriors and compute the differences to answer questions like “what is the probability that a randomly selected male will be heavier than a randomly selected female?”
  • This was basically just an intercept-only model for sex, and the effect due to height would be captured in that difference

Direct Effect

  • How do we partial out the indirect effect of Height (block it)?
    • Stratify by height to block the association between S and W that is transmitted through H
  • Difference in intercept, the indirect is slope differences
  • Here, the model allows for separate slopes by sex, so we can tease out the impact of height
    • Center the height to make the interpretation of the intercept be the average
    • Makes priors more intuitive, and computation easier

\[W_i \sim Normal(\mu_i, \sigma) \hskip.1in \mu_i=\alpha_{S[i]} + \beta_{S[i]}(H_i-\bar{H})\] \[\alpha=[\alpha_1,\alpha_2] \hskip.1in \beta=[\beta_1,\beta_2]\]

  • In this case, nearly all the total effect of sex on weight is explained through height (the direct effect (posterior of the difference between weights at each height) is nearly 0 at all heights)

Curve Fitting

  • We use linear models to do this; i.e., it’s not mechanistic, but we use it wisely
  • Strategies
    • Polynomials: Don’t do it; no local smoothing, only global; learn to much from data in regions that lie far away
      • It’s not worth having a model that looks OK for most of the data that we know is completely erroneous (e.g., parabola at some point shows babies get heavier as their height decreases, which we know is wrong); even though this is a small portation of observations, it’s still knowingly wrong, so why use it?
    • Splines & GAMs: Not as bad as polynomials; add total many locally trained terms

Splines

  • Flexible curve that will find trends
  • B-splines are linear models containing additive terms with synthetic variables
    • Think of it as a collection of individual curves (basis functions), but the weight of each basis function is non-zero at only particular areas of x, and spline is the sum of the curves at a particular point

\[\mu_i = \alpha + w_1B_{i,1} + w_2B_{i,2} + ...\]

  • Ideal model for age/height would be to account for what we know about human biology: infant, toddler, adolescent, adult. In the first 3, we expect only upward growth, so we should constrain.

Full Luxury Bayes

  • Equivalent approach is to use one model for entire causal sample
  • Then run simulations from overall system to get answers to specific queries