How to use group_by instead of split’s to summarize things.
R
dplyr
split
group-by
programming
development
Author
Robert M Flight
Published
February 25, 2020
TL;DR
It is relatively easy to use dplyr::group_by and summarise to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use split and iterate, but group_by is much easier.
Motivation
I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items. Let’s create some data as an example.
In this example, we have the 26 lowercase letters, that are part of one of groups 1-1000. Now, we might want to keep any groups that contain at least one “a”, for example.
I would have previously used a split on the groups, and then purrr::map_lgl returning TRUE or FALSE to check if what we wanted to filter on was present, and then filter out the split groups, and finally put back together the full thing.
Group By
What I’ve found instead is that I can use a combination of group_by, summarise and then filter to same effect, without splitting and iterating (yes, I know dplyr is doing it under the hood for me).
# use group_by and summarize to find things we wantgroups_to_keep = grouped_data %>%group_by(groups) %>%summarise(has_a =sum(items %in%"a") >0) %>%filter(has_a)# filter on original based on abovegrouped_data2 = grouped_data %>%filter(groups %in% groups_to_keep$groups)
This was a game changer for me in my thinking. As I’ve used group_by combined with summarise more and more, I’ve become amazed at what can be done without having to fully split the data apart to operate on it.
This combined with the use of dplyr::join_ in place of splits (see this other post for an example) is making my code faster, and often easier to reason over. I hope it helps you too!