Using group_by Instead of Splits

2020-02-25 2 min read

TL;DR

It is relatively easy to use dplyr::group_by and summarise to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use split and iterate, but group_by is much easier.

Motivation

I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items. Let’s create some data as an example.

library(dplyr)
set.seed(1234)
groups = as.character(seq(1, 1000))
grouped_data = data.frame(items = sample(letters, 10000, replace = TRUE),
                          groups = sample(groups, 10000, replace = TRUE),
                          stringsAsFactors = FALSE)

knitr::kable(head(grouped_data))

items	groups
c	205
q	485
p	636
q	960
w	179
q	289

In this example, we have the 26 lowercase letters, that are part of one of groups 1-1000. Now, we might want to keep any groups that contain at least one “a”, for example.

I would have previously used a split on the groups, and then purrr::map_lgl returning TRUE or FALSE to check if what we wanted to filter on was present, and then filter out the split groups, and finally put back together the full thing.

Group By

What I’ve found instead is that I can use a combination of group_by, summarise and then filter to same effect, without splitting and iterating (yes, I know dplyr is doing it under the hood for me).

# use group_by and summarize to find things we want
groups_to_keep = grouped_data %>% 
  group_by(groups) %>%
  summarise(has_a = sum(items %in% "a") > 0) %>%
  filter(has_a)

# filter on original based on above
grouped_data2 = grouped_data %>%
  filter(groups %in% groups_to_keep$groups)

This was a game changer for me in my thinking. As I’ve used group_by combined with summarise more and more, I’ve become amazed at what can be done without having to fully split the data apart to operate on it.

This combined with the use of dplyr::join_ in place of splits (see this other post for an example) is making my code faster, and often easier to reason over. I hope it helps you too!

R dplyr

Using group_by Instead of Splits

TL;DR

Motivation

Group By

Related