Nesting with {tidyr}

Data Wrangling

Author

Alex Zajichek

Published

February 2, 2023

I really love the elegance of the nest functionality with the tidyr package. It really allows you to abstract the meaning of a data frame to not just contain rectangular data with scalars, but rather a generalization that has rectangular data of objects. The most intriguing part of it to me is the way we can continue to use typical join operations even with complex objects in some of the columns, which makes it so smooth and intuitive to do complex data operations.

Code

# Load packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

For example, lets say we have a dataset.

Code

dat <- cheese::heart_disease
dat

# A tibble: 303 × 9
     Age Sex    ChestPain           BP Cholesterol BloodSugar MaximumHR
   <dbl> <fct>  <fct>            <dbl>       <dbl> <lgl>          <dbl>
 1    63 Male   Typical angina     145         233 TRUE             150
 2    67 Male   Asymptomatic       160         286 FALSE            108
 3    67 Male   Asymptomatic       120         229 FALSE            129
 4    37 Male   Non-anginal pain   130         250 FALSE            187
 5    41 Female Atypical angina    130         204 FALSE            172
 6    56 Male   Atypical angina    120         236 FALSE            178
 7    62 Female Asymptomatic       140         268 FALSE            160
 8    57 Female Asymptomatic       120         354 FALSE            163
 9    63 Male   Asymptomatic       130         254 FALSE            147
10    53 Male   Asymptomatic       140         203 TRUE             155
# ℹ 293 more rows
# ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>

And we want to compute age percentiles by sex for those who do and don’t have heart disease.

Code

# Nest data frames within sex, heart disease
nested1 <-
  dat %>%
  group_by(Sex, HeartDisease) %>%
  nest()
nested1

# A tibble: 4 × 3
# Groups:   Sex, HeartDisease [4]
  Sex    HeartDisease data              
  <fct>  <fct>        <list>            
1 Male   No           <tibble [92 × 7]> 
2 Male   Yes          <tibble [114 × 7]>
3 Female No           <tibble [72 × 7]> 
4 Female Yes          <tibble [25 × 7]>

We can see that there is now a separate dataset available within each combination of sex and heart disease status in the form of a list column.

Code

# Get the empirical cumulative density function for age
nested2 <-
  nested1 %>%
  mutate(
    ecdf_col = data %>% map(~ecdf(.x$Age))
  )
nested2

# A tibble: 4 × 4
# Groups:   Sex, HeartDisease [4]
  Sex    HeartDisease data               ecdf_col
  <fct>  <fct>        <list>             <list>  
1 Male   No           <tibble [92 × 7]>  <ecdf>  
2 Male   Yes          <tibble [114 × 7]> <ecdf>  
3 Female No           <tibble [72 × 7]>  <ecdf>  
4 Female Yes          <tibble [25 × 7]>  <ecdf>

We then apply list operations as we normally would. In this case, we use purrr::map to create an empirical cumulative density function for age within each group. The result is then just a list of ecdf functions.

Code

# Make an age grid
age_grid <-
  dat %>%
  select(Sex, HeartDisease) %>%
  distinct() %>%
  inner_join(
    y = tibble(Age = c(40, 50, 60, 70)),
    by = character()
  )

Warning: Using `by = character()` to perform a cross join was deprecated in dplyr 1.1.0.
ℹ Please use `cross_join()` instead.

Code

age_grid

# A tibble: 16 × 3
   Sex    HeartDisease   Age
   <fct>  <fct>        <dbl>
 1 Male   No              40
 2 Male   No              50
 3 Male   No              60
 4 Male   No              70
 5 Male   Yes             40
 6 Male   Yes             50
 7 Male   Yes             60
 8 Male   Yes             70
 9 Female No              40
10 Female No              50
11 Female No              60
12 Female No              70
13 Female Yes             40
14 Female Yes             50
15 Female Yes             60
16 Female Yes             70

We then made an age grid for each sex/heart disease combination to evaluate the percentiles of each age in the respective groups. Now, we can compute the percentiles by joining to get the ecdf for the respective group, and plugging in each age into the function.

Code

age_grid %>% 
  
  # Join to get the ecdf for the group
  inner_join(
    y = nested2 %>% select(-data),
    by = c("Sex", "HeartDisease")
  ) %>%
  
  # Compute the percentile for the given age
  mutate(
    Percentile = map2(.x = ecdf_col, .y = as.list(Age), .f = ~.x(.y)) %>% flatten_dbl()
  )

# A tibble: 16 × 5
   Sex    HeartDisease   Age ecdf_col Percentile
   <fct>  <fct>        <dbl> <list>        <dbl>
 1 Male   No              40 <ecdf>       0.0761
 2 Male   No              50 <ecdf>       0.424 
 3 Male   No              60 <ecdf>       0.859 
 4 Male   No              70 <ecdf>       1     
 5 Male   Yes             40 <ecdf>       0.0526
 6 Male   Yes             50 <ecdf>       0.246 
 7 Male   Yes             60 <ecdf>       0.719 
 8 Male   Yes             70 <ecdf>       0.991 
 9 Female No              40 <ecdf>       0.0694
10 Female No              50 <ecdf>       0.361 
11 Female No              60 <ecdf>       0.694 
12 Female No              70 <ecdf>       0.931 
13 Female Yes             40 <ecdf>       0     
14 Female Yes             50 <ecdf>       0.04  
15 Female Yes             60 <ecdf>       0.52  
16 Female Yes             70 <ecdf>       1

We can see, for example, that a 60 year old male is at the \(86^{th}\) percentile for those without heart disease, but at the \(72^{nd}\) for those who due, suggesting that the age distribution tends to be higher in patients with heart disease.