Nesting with tidyr

I really love the elegance of the nest functionality with the tidyr package. It really allows you to abstract the meaning of a data frame to not just contain rectangular data with scalars, but rather a generalization that has rectangular data of objects. The most intriguing part of it to me is the way we can continue to use typical join operations even with complex objects in some of the columns, which makes it so smooth and intuitive to do complex data operations.

# Load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.0
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

For example, lets say we have a dataset.

dat <- cheese::heart_disease
dat
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
##    <dbl> <fct>  <fct>            <dbl>     <dbl> <lgl>     <dbl> <fct>   <fct>  
##  1    63 Male   Typical angina     145       233 TRUE        150 No      No     
##  2    67 Male   Asymptomatic       160       286 FALSE       108 Yes     Yes    
##  3    67 Male   Asymptomatic       120       229 FALSE       129 Yes     Yes    
##  4    37 Male   Non-anginal pain   130       250 FALSE       187 No      No     
##  5    41 Female Atypical angina    130       204 FALSE       172 No      No     
##  6    56 Male   Atypical angina    120       236 FALSE       178 No      No     
##  7    62 Female Asymptomatic       140       268 FALSE       160 No      Yes    
##  8    57 Female Asymptomatic       120       354 FALSE       163 Yes     No     
##  9    63 Male   Asymptomatic       130       254 FALSE       147 No      Yes    
## 10    53 Male   Asymptomatic       140       203 TRUE        155 Yes     Yes    
## # … with 293 more rows, and abbreviated variable names ¹​Cholesterol,
## #   ²​BloodSugar, ³​MaximumHR, ⁴​ExerciseInducedAngina, ⁵​HeartDisease

And we want to compute age percentiles by sex for those who do and don’t have heart disease.

# Nest data frames within sex, heart disease
nested1 <-
  dat %>%
  group_by(Sex, HeartDisease) %>%
  nest()
nested1
## # A tibble: 4 × 3
## # Groups:   Sex, HeartDisease [4]
##   Sex    HeartDisease data              
##   <fct>  <fct>        <list>            
## 1 Male   No           <tibble [92 × 7]> 
## 2 Male   Yes          <tibble [114 × 7]>
## 3 Female No           <tibble [72 × 7]> 
## 4 Female Yes          <tibble [25 × 7]>

We can see that there is now a separate dataset available within each combination of sex and heart disease status in the form of a list column.

# Get the empirical cumulative density function for age
nested2 <-
  nested1 %>%
  mutate(
    ecdf_col = data %>% map(~ecdf(.x$Age))
  )
nested2
## # A tibble: 4 × 4
## # Groups:   Sex, HeartDisease [4]
##   Sex    HeartDisease data               ecdf_col
##   <fct>  <fct>        <list>             <list>  
## 1 Male   No           <tibble [92 × 7]>  <ecdf>  
## 2 Male   Yes          <tibble [114 × 7]> <ecdf>  
## 3 Female No           <tibble [72 × 7]>  <ecdf>  
## 4 Female Yes          <tibble [25 × 7]>  <ecdf>

We then apply list operations as we normally would. In this case, we use purrr::map to create an empirical cumulative density function for age within each group. The result is then just a list of ecdf functions.

# Make an age grid
age_grid <-
  dat %>%
  select(Sex, HeartDisease) %>%
  distinct() %>%
  inner_join(
    y = tibble(Age = c(40, 50, 60, 70)),
    by = character()
  )
## Warning: Using `by = character()` to perform a cross join was deprecated in dplyr 1.1.0.
## ℹ Please use `cross_join()` instead.
age_grid
## # A tibble: 16 × 3
##    Sex    HeartDisease   Age
##    <fct>  <fct>        <dbl>
##  1 Male   No              40
##  2 Male   No              50
##  3 Male   No              60
##  4 Male   No              70
##  5 Male   Yes             40
##  6 Male   Yes             50
##  7 Male   Yes             60
##  8 Male   Yes             70
##  9 Female No              40
## 10 Female No              50
## 11 Female No              60
## 12 Female No              70
## 13 Female Yes             40
## 14 Female Yes             50
## 15 Female Yes             60
## 16 Female Yes             70

We then made an age grid for each sex/heart disease combination to evaluate the percentiles of each age in the respective groups. Now, we can compute the percentiles by joining to get the ecdf for the respective group, and plugging in each age into the function.

age_grid %>% 
  
  # Join to get the ecdf for the group
  inner_join(
    y = nested2 %>% select(-data),
    by = c("Sex", "HeartDisease")
  ) %>%
  
  # Compute the percentile for the given age
  mutate(
    Percentile = map2(.x = ecdf_col, .y = as.list(Age), .f = ~.x(.y)) %>% flatten_dbl()
  )
## # A tibble: 16 × 5
##    Sex    HeartDisease   Age ecdf_col Percentile
##    <fct>  <fct>        <dbl> <list>        <dbl>
##  1 Male   No              40 <ecdf>       0.0761
##  2 Male   No              50 <ecdf>       0.424 
##  3 Male   No              60 <ecdf>       0.859 
##  4 Male   No              70 <ecdf>       1     
##  5 Male   Yes             40 <ecdf>       0.0526
##  6 Male   Yes             50 <ecdf>       0.246 
##  7 Male   Yes             60 <ecdf>       0.719 
##  8 Male   Yes             70 <ecdf>       0.991 
##  9 Female No              40 <ecdf>       0.0694
## 10 Female No              50 <ecdf>       0.361 
## 11 Female No              60 <ecdf>       0.694 
## 12 Female No              70 <ecdf>       0.931 
## 13 Female Yes             40 <ecdf>       0     
## 14 Female Yes             50 <ecdf>       0.04  
## 15 Female Yes             60 <ecdf>       0.52  
## 16 Female Yes             70 <ecdf>       1

We can see, for example, that a 60 year old male is at the \(86^{th}\) percentile for those without heart disease, but at the \(72^{nd}\) for those who due, suggesting that the age distribution tends to be higher in patients with heart disease.

Alex Zajichek
Alex Zajichek
Statistician & Data Scientist
comments powered by Disqus

Related