It is relatively easy to use
summarise to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use
split and iterate, but
group_by is much easier.
I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items. Let’s create some data as an example.
library(dplyr) set.seed(1234) groups = as.character(seq(1, 1000)) grouped_data = data.frame(items = sample(letters, 10000, replace = TRUE), groups = sample(groups, 10000, replace = TRUE), stringsAsFactors = FALSE) knitr::kable(head(grouped_data))
In this example, we have the 26 lowercase letters, that are part of one of groups 1-1000. Now, we might want to keep any groups that contain at least one “a”, for example.
I would have previously used a
split on the groups, and then
purrr::map_lgl returning TRUE or FALSE to check if what we wanted to filter on was present, and then filter out the split groups, and finally put back together the full thing.
What I’ve found instead is that I can use a combination of
summarise and then
filter to same effect, without splitting and iterating (yes, I know
dplyr is doing it under the hood for me).
# use group_by and summarize to find things we want groups_to_keep = grouped_data %>% group_by(groups) %>% summarise(has_a = sum(items %in% "a") > 0) %>% filter(has_a) # filter on original based on above grouped_data2 = grouped_data %>% filter(groups %in% groups_to_keep$groups)
This was a game changer for me in my thinking. As I’ve used
group_by combined with
summarise more and more, I’ve become amazed at what can be done without having to fully split the data apart to operate on it.
This combined with the use of
dplyr::join_ in place of splits (see this other post for an example) is making my code faster, and often easier to reason over. I hope it helps you too!