TL;DR
It is relatively easy to use dplyr::group_by
and summarise
to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use split
and iterate, but group_by
is much easier.
Motivation
I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items. Let’s create some data as an example.
library(dplyr)
set.seed(1234)
groups = as.character(seq(1, 1000))
grouped_data = data.frame(items = sample(letters, 10000, replace = TRUE),
groups = sample(groups, 10000, replace = TRUE),
stringsAsFactors = FALSE)
knitr::kable(head(grouped_data))
items | groups |
---|---|
c | 205 |
q | 485 |
p | 636 |
q | 960 |
w | 179 |
q | 289 |
In this example, we have the 26 lowercase letters, that are part of one of groups 1-1000. Now, we might want to keep any groups that contain at least one “a”, for example.
I would have previously used a split
on the groups, and then purrr::map_lgl
returning TRUE or FALSE to check if what we wanted to filter on was present, and then filter out the split groups, and finally put back together the full thing.
Group By
What I’ve found instead is that I can use a combination of group_by
, summarise
and then filter
to same effect, without splitting and iterating (yes, I know dplyr
is doing it under the hood for me).
# use group_by and summarize to find things we want
groups_to_keep = grouped_data %>%
group_by(groups) %>%
summarise(has_a = sum(items %in% "a") > 0) %>%
filter(has_a)
# filter on original based on above
grouped_data2 = grouped_data %>%
filter(groups %in% groups_to_keep$groups)
This was a game changer for me in my thinking. As I’ve used group_by
combined with summarise
more and more, I’ve become amazed at what can be done without having to fully split the data apart to operate on it.
This combined with the use of dplyr::join_
in place of splits (see this other post for an example) is making my code faster, and often easier to reason over. I hope it helps you too!