I really love the elegance of the nest functionality with the tidyr package. It really allows you to abstract the meaning of a data frame to not just contain rectangular data with scalars, but rather a generalization that has rectangular data of objects. The most intriguing part of it to me is the way we can continue to use typical join operations even with complex objects in some of the columns, which makes it so smooth and intuitive to do complex data operations.
Code
# Load packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
For example, lets say we have a dataset.
Code
dat <- cheese::heart_diseasedat
# A tibble: 303 × 9
Age Sex ChestPain BP Cholesterol BloodSugar MaximumHR
<dbl> <fct> <fct> <dbl> <dbl> <lgl> <dbl>
1 63 Male Typical angina 145 233 TRUE 150
2 67 Male Asymptomatic 160 286 FALSE 108
3 67 Male Asymptomatic 120 229 FALSE 129
4 37 Male Non-anginal pain 130 250 FALSE 187
5 41 Female Atypical angina 130 204 FALSE 172
6 56 Male Atypical angina 120 236 FALSE 178
7 62 Female Asymptomatic 140 268 FALSE 160
8 57 Female Asymptomatic 120 354 FALSE 163
9 63 Male Asymptomatic 130 254 FALSE 147
10 53 Male Asymptomatic 140 203 TRUE 155
# ℹ 293 more rows
# ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>
And we want to compute age percentiles by sex for those who do and don’t have heart disease.
Code
# Nest data frames within sex, heart diseasenested1 <- dat %>%group_by(Sex, HeartDisease) %>%nest()nested1
# A tibble: 4 × 3
# Groups: Sex, HeartDisease [4]
Sex HeartDisease data
<fct> <fct> <list>
1 Male No <tibble [92 × 7]>
2 Male Yes <tibble [114 × 7]>
3 Female No <tibble [72 × 7]>
4 Female Yes <tibble [25 × 7]>
We can see that there is now a separate dataset available within each combination of sex and heart disease status in the form of a list column.
Code
# Get the empirical cumulative density function for agenested2 <- nested1 %>%mutate(ecdf_col = data %>%map(~ecdf(.x$Age)) )nested2
# A tibble: 4 × 4
# Groups: Sex, HeartDisease [4]
Sex HeartDisease data ecdf_col
<fct> <fct> <list> <list>
1 Male No <tibble [92 × 7]> <ecdf>
2 Male Yes <tibble [114 × 7]> <ecdf>
3 Female No <tibble [72 × 7]> <ecdf>
4 Female Yes <tibble [25 × 7]> <ecdf>
We then apply list operations as we normally would. In this case, we use purrr::map to create an empirical cumulative density function for age within each group. The result is then just a list of ecdf functions.
Code
# Make an age gridage_grid <- dat %>%select(Sex, HeartDisease) %>%distinct() %>%inner_join(y =tibble(Age =c(40, 50, 60, 70)),by =character() )
Warning: Using `by = character()` to perform a cross join was deprecated in dplyr 1.1.0.
ℹ Please use `cross_join()` instead.
Code
age_grid
# A tibble: 16 × 3
Sex HeartDisease Age
<fct> <fct> <dbl>
1 Male No 40
2 Male No 50
3 Male No 60
4 Male No 70
5 Male Yes 40
6 Male Yes 50
7 Male Yes 60
8 Male Yes 70
9 Female No 40
10 Female No 50
11 Female No 60
12 Female No 70
13 Female Yes 40
14 Female Yes 50
15 Female Yes 60
16 Female Yes 70
We then made an age grid for each sex/heart disease combination to evaluate the percentiles of each age in the respective groups. Now, we can compute the percentiles by joining to get the ecdf for the respective group, and plugging in each age into the function.
Code
age_grid %>%# Join to get the ecdf for the groupinner_join(y = nested2 %>%select(-data),by =c("Sex", "HeartDisease") ) %>%# Compute the percentile for the given agemutate(Percentile =map2(.x = ecdf_col, .y =as.list(Age), .f =~.x(.y)) %>%flatten_dbl() )
# A tibble: 16 × 5
Sex HeartDisease Age ecdf_col Percentile
<fct> <fct> <dbl> <list> <dbl>
1 Male No 40 <ecdf> 0.0761
2 Male No 50 <ecdf> 0.424
3 Male No 60 <ecdf> 0.859
4 Male No 70 <ecdf> 1
5 Male Yes 40 <ecdf> 0.0526
6 Male Yes 50 <ecdf> 0.246
7 Male Yes 60 <ecdf> 0.719
8 Male Yes 70 <ecdf> 0.991
9 Female No 40 <ecdf> 0.0694
10 Female No 50 <ecdf> 0.361
11 Female No 60 <ecdf> 0.694
12 Female No 70 <ecdf> 0.931
13 Female Yes 40 <ecdf> 0
14 Female Yes 50 <ecdf> 0.04
15 Female Yes 60 <ecdf> 0.52
16 Female Yes 70 <ecdf> 1
We can see, for example, that a 60 year old male is at the \(86^{th}\) percentile for those without heart disease, but at the \(72^{nd}\) for those who due, suggesting that the age distribution tends to be higher in patients with heart disease.
Source Code
---title: "Nesting with {tidyr}"author: "Alex Zajichek"date: "2/2/2023"image: "feature.png"categories: - Data Wranglingformat: html: code-fold: true code-tools: true---I really love the elegance of the [`nest`](https://tidyr.tidyverse.org/reference/nest.html) functionality with the [`tidyr`](https://tidyr.tidyverse.org/) package. It really allows you to abstract the meaning of a data frame to not just contain rectangular data with scalars, but rather a generalization that has rectangular data of _objects_. The most intriguing part of it to me is the way we can continue to use typical [`join`](https://dplyr.tidyverse.org/reference/mutate-joins.html) operations even with complex objects in some of the columns, which makes it so smooth and intuitive to do complex data operations.```{r}# Load packageslibrary(tidyverse)```For example, lets say we have a dataset.```{r}dat <- cheese::heart_diseasedat```And we want to compute age percentiles by sex for those who do and don't have heart disease.```{r}# Nest data frames within sex, heart diseasenested1 <- dat %>%group_by(Sex, HeartDisease) %>%nest()nested1```We can see that there is now a separate dataset available within each combination of sex and heart disease status in the form of a `list` column.```{r}# Get the empirical cumulative density function for agenested2 <- nested1 %>%mutate(ecdf_col = data %>%map(~ecdf(.x$Age)) )nested2```We then apply `list` operations as we normally would. In this case, we use `purrr::map` to create an empirical cumulative density function for age within each group. The result is then just a `list` of `ecdf` functions.```{r}# Make an age gridage_grid <- dat %>%select(Sex, HeartDisease) %>%distinct() %>%inner_join(y =tibble(Age =c(40, 50, 60, 70)),by =character() )age_grid```We then made an age grid for each sex/heart disease combination to evaluate the percentiles of each age in the respective groups. Now, we can compute the percentiles by joining to get the `ecdf` for the respective group, and plugging in each age into the function.```{r}age_grid %>%# Join to get the ecdf for the groupinner_join(y = nested2 %>%select(-data),by =c("Sex", "HeartDisease") ) %>%# Compute the percentile for the given agemutate(Percentile =map2(.x = ecdf_col, .y =as.list(Age), .f =~.x(.y)) %>%flatten_dbl() )```We can see, for example, that a 60 year old male is at the $86^{th}$ percentile for those without heart disease, but at the $72^{nd}$ for those who due, suggesting that the age distribution tends to be higher in patients with heart disease.