Comparisons using for loops vs split

R for-loop split purrr development

for loops often hide much of the actual logic of your code because of all the necessary boilerplate of running a loop. split-ting your data can oftentimes be clearer, and faster.

Robert M Flight
2019-02-13

TL;DR

Sometimes for loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they might still be slow(er) than other ways of doing things.

Comparing Groups

I have some code where I am trying to determine duplicates of a group of things. This data looks something like this:

create_random_sets = function(n_sets = 1000){
  set.seed(1234)
  
  sets = purrr::map(seq(5, n_sets), ~ sample(seq(1, .x), 5))
  
  item_sets = sample(seq(1, length(sets)), 10000, replace = TRUE)
  item_mapping = purrr::map2_df(item_sets, seq(1, length(item_sets)), function(.x, .y){
    data.frame(v1 = as.character(.y), v2 = sets[[.x]], stringsAsFactors = FALSE)
  })
  item_mapping
}
library(dplyr)
mapped_items = create_random_sets()

head(mapped_items, 20)
   v1  v2
1   1 375
2   1 255
3   1 268
4   1  52
5   1 241
6   2 143
7   2 401
8   2 127
9   2 372
10  2 100
11  3  62
12  3 109
13  3  72
14  3 390
15  3  94
16  4  57
17  4  55
18  4 147
19  4 236
20  4 120

Looping

In this case, every item in v1 has 5 things in v2. I really want to group multiple things of v1 that have the same combination of things in v2. My initial function to do this splits everything in v2 by v1, and then compares all the splits to each other, removing things that have been compared and found to be the same, and saving them as we go. This required two loops, basically while there was data to check, check all the other things left in the list against it (the for). Pre-initialize the list of things that are identical to each other so we don’t take a hit on allocation, and delete the things that have been checked or noted as identical. Although the variable names are changed, the code for that function is below.

loop_function = function(item_mapping){
  split_items = split(item_mapping$v2, item_mapping$v1)
  
  matched_list = vector("list", length(split_items))
  
  save_item = 1
  save_index = 1
  
  while (length(split_items) > 0) {
    curr_item = names(split_items)[save_item]
    curr_set = split_items[[save_item]]
    
    for (i_item in seq_along(split_items)) {
      if (sum(split_items[[i_item]] %in% curr_set) == length(curr_set)) {
        matching_items = unique(c(curr_item, names(split_items)[i_item]))
        save_item = unique(c(save_item, i_item))
      }
    }
    matched_list[[save_index]] = curr_set
    split_items = split_items[-save_item]
    save_index = save_index + 1
    save_item = 1
  }
  
  n_in_set = purrr::map_int(matched_list, length)
  matched_list = matched_list[n_in_set > 0]
  n_in_set = n_in_set[n_in_set > 0]
  matched_list
}

The code works, but it doesn’t really make me think about what it’s doing, the two loops hide the fact that what is really going on is comparing things to one another. Miles McBain recently posted on this fact, that loops can be necessary, but one should really think about whether they are really necessary, or do they hide something about the data, and can we think about different ways to do the same thing.

This made me realize that what I really wanted to do was split the items in v1 by the unique combinations of things in v2, because split will group things together nicely for you, without any extra work. But I don’t have those combinations in a way that split can use them. So my solution is to iterate over the splits using purrr, create a representation of the group as a character value, and then call split again at the very end based on the character representation.

split_function = function(item_mapping){
  mapped_data = split(item_mapping$v2, item_mapping$v1) %>%
    purrr::map2_dfr(., names(.), function(.x, .y){
      set = unique(.x)
      tmp_frame = data.frame(item = .y, set_chr = paste(set, collapse = ","), stringsAsFactors = FALSE)
      tmp_frame$set = list(set)
      tmp_frame
    })
  matched_list = split(mapped_data, mapped_data$set_chr)
}

Not only is the code cleaner, the grouping is explicit (as long as you know how split works), and its also 4x faster!

microbenchmark::microbenchmark(
  loop_function(mapped_items),
  split_function(mapped_items),
  times = 5
)
Unit: seconds
                         expr      min       lq     mean   median
  loop_function(mapped_items) 7.679313 7.897748 7.956645 7.907971
 split_function(mapped_items) 2.518144 2.597200 2.654516 2.623375
       uq      max neval
 8.117534 8.180659     5
 2.722928 2.810932     5

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Flight (2019, Feb. 13). Deciphering Life: One Bit at a Time: Comparisons using for loops vs split. Retrieved from https://rmflight.github.io/posts/2019-02-13-for-loops-vs-split/

BibTeX citation

@misc{flight2019comparisons,
  author = {Flight, Robert M},
  title = {Deciphering Life: One Bit at a Time: Comparisons using for loops vs split},
  url = {https://rmflight.github.io/posts/2019-02-13-for-loops-vs-split/},
  year = {2019}
}