Using group_by Instead of Splits

R dplyr split group-by programming development

How to use group_by instead of split’s to summarize things.

Robert M Flight
2020-02-25

TL;DR

It is relatively easy to use dplyr::group_by and summarise to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use split and iterate, but group_by is much easier.

Motivation

I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items. Let’s create some data as an example.

library(dplyr)
set.seed(1234)
groups = as.character(seq(1, 1000))
grouped_data = data.frame(items = sample(letters, 10000, replace = TRUE),
                          groups = sample(groups, 10000, replace = TRUE),
                          stringsAsFactors = FALSE)

knitr::kable(head(grouped_data))
items groups
p 891
z 646
v 795
e 49
l 19
o 796

In this example, we have the 26 lowercase letters, that are part of one of groups 1-1000. Now, we might want to keep any groups that contain at least one “a”, for example.

I would have previously used a split on the groups, and then purrr::map_lgl returning TRUE or FALSE to check if what we wanted to filter on was present, and then filter out the split groups, and finally put back together the full thing.

Group By

What I’ve found instead is that I can use a combination of group_by, summarise and then filter to same effect, without splitting and iterating (yes, I know dplyr is doing it under the hood for me).

# use group_by and summarize to find things we want
groups_to_keep = grouped_data %>% 
  group_by(groups) %>%
  summarise(has_a = sum(items %in% "a") > 0) %>%
  filter(has_a)

# filter on original based on above
grouped_data2 = grouped_data %>%
  filter(groups %in% groups_to_keep$groups)

This was a game changer for me in my thinking. As I’ve used group_by combined with summarise more and more, I’ve become amazed at what can be done without having to fully split the data apart to operate on it.

This combined with the use of dplyr::join_ in place of splits (see this other post for an example) is making my code faster, and often easier to reason over. I hope it helps you too!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Flight (2020, Feb. 25). Deciphering Life: One Bit at a Time: Using group_by Instead of Splits. Retrieved from https://rmflight.github.io/posts/2020-02-25-using-group-by-instead-of-splitting/

BibTeX citation

@misc{flight2020using,
  author = {Flight, Robert M},
  title = {Deciphering Life: One Bit at a Time: Using group_by Instead of Splits},
  url = {https://rmflight.github.io/posts/2020-02-25-using-group-by-instead-of-splitting/},
  year = {2020}
}