Nesting with tidyr
I really love the elegance of the nest
functionality with the tidyr
package. It really allows you to abstract the meaning of a data frame to not just contain rectangular data with scalars, but rather a generalization that has rectangular data of objects. The most intriguing part of it to me is the way we can continue to use typical join
operations even with complex objects in some of the columns, which makes it so smooth and intuitive to do complex data operations.
# Load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
For example, lets say we have a dataset.
dat <- cheese::heart_disease
dat
## # A tibble: 303 × 9
## Age Sex ChestPain BP Cholest…¹ Blood…² Maxim…³ Exerc…⁴ Heart…⁵
## <dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl> <fct> <fct>
## 1 63 Male Typical angina 145 233 TRUE 150 No No
## 2 67 Male Asymptomatic 160 286 FALSE 108 Yes Yes
## 3 67 Male Asymptomatic 120 229 FALSE 129 Yes Yes
## 4 37 Male Non-anginal pain 130 250 FALSE 187 No No
## 5 41 Female Atypical angina 130 204 FALSE 172 No No
## 6 56 Male Atypical angina 120 236 FALSE 178 No No
## 7 62 Female Asymptomatic 140 268 FALSE 160 No Yes
## 8 57 Female Asymptomatic 120 354 FALSE 163 Yes No
## 9 63 Male Asymptomatic 130 254 FALSE 147 No Yes
## 10 53 Male Asymptomatic 140 203 TRUE 155 Yes Yes
## # … with 293 more rows, and abbreviated variable names ¹Cholesterol,
## # ²BloodSugar, ³MaximumHR, ⁴ExerciseInducedAngina, ⁵HeartDisease
And we want to compute age percentiles by sex for those who do and don’t have heart disease.
# Nest data frames within sex, heart disease
nested1 <-
dat %>%
group_by(Sex, HeartDisease) %>%
nest()
nested1
## # A tibble: 4 × 3
## # Groups: Sex, HeartDisease [4]
## Sex HeartDisease data
## <fct> <fct> <list>
## 1 Male No <tibble [92 × 7]>
## 2 Male Yes <tibble [114 × 7]>
## 3 Female No <tibble [72 × 7]>
## 4 Female Yes <tibble [25 × 7]>
We can see that there is now a separate dataset available within each combination of sex and heart disease status in the form of a list
column.
# Get the empirical cumulative density function for age
nested2 <-
nested1 %>%
mutate(
ecdf_col = data %>% map(~ecdf(.x$Age))
)
nested2
## # A tibble: 4 × 4
## # Groups: Sex, HeartDisease [4]
## Sex HeartDisease data ecdf_col
## <fct> <fct> <list> <list>
## 1 Male No <tibble [92 × 7]> <ecdf>
## 2 Male Yes <tibble [114 × 7]> <ecdf>
## 3 Female No <tibble [72 × 7]> <ecdf>
## 4 Female Yes <tibble [25 × 7]> <ecdf>
We then apply list
operations as we normally would. In this case, we use purrr::map
to create an empirical cumulative density function for age within each group. The result is then just a list
of ecdf
functions.
# Make an age grid
age_grid <-
dat %>%
select(Sex, HeartDisease) %>%
distinct() %>%
inner_join(
y = tibble(Age = c(40, 50, 60, 70)),
by = character()
)
## Warning: Using `by = character()` to perform a cross join was deprecated in dplyr 1.1.0.
## ℹ Please use `cross_join()` instead.
age_grid
## # A tibble: 16 × 3
## Sex HeartDisease Age
## <fct> <fct> <dbl>
## 1 Male No 40
## 2 Male No 50
## 3 Male No 60
## 4 Male No 70
## 5 Male Yes 40
## 6 Male Yes 50
## 7 Male Yes 60
## 8 Male Yes 70
## 9 Female No 40
## 10 Female No 50
## 11 Female No 60
## 12 Female No 70
## 13 Female Yes 40
## 14 Female Yes 50
## 15 Female Yes 60
## 16 Female Yes 70
We then made an age grid for each sex/heart disease combination to evaluate the percentiles of each age in the respective groups. Now, we can compute the percentiles by joining to get the ecdf
for the respective group, and plugging in each age into the function.
age_grid %>%
# Join to get the ecdf for the group
inner_join(
y = nested2 %>% select(-data),
by = c("Sex", "HeartDisease")
) %>%
# Compute the percentile for the given age
mutate(
Percentile = map2(.x = ecdf_col, .y = as.list(Age), .f = ~.x(.y)) %>% flatten_dbl()
)
## # A tibble: 16 × 5
## Sex HeartDisease Age ecdf_col Percentile
## <fct> <fct> <dbl> <list> <dbl>
## 1 Male No 40 <ecdf> 0.0761
## 2 Male No 50 <ecdf> 0.424
## 3 Male No 60 <ecdf> 0.859
## 4 Male No 70 <ecdf> 1
## 5 Male Yes 40 <ecdf> 0.0526
## 6 Male Yes 50 <ecdf> 0.246
## 7 Male Yes 60 <ecdf> 0.719
## 8 Male Yes 70 <ecdf> 0.991
## 9 Female No 40 <ecdf> 0.0694
## 10 Female No 50 <ecdf> 0.361
## 11 Female No 60 <ecdf> 0.694
## 12 Female No 70 <ecdf> 0.931
## 13 Female Yes 40 <ecdf> 0
## 14 Female Yes 50 <ecdf> 0.04
## 15 Female Yes 60 <ecdf> 0.52
## 16 Female Yes 70 <ecdf> 1
We can see, for example, that a 60 year old male is at the \(86^{th}\) percentile for those without heart disease, but at the \(72^{nd}\) for those who due, suggesting that the age distribution tends to be higher in patients with heart disease.