Comparisons using for loops vs split

TL;DR

Sometimes for loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they might still be slow(er) than other ways of doing things.

Comparing Groups

I have some code where I am trying to determine duplicates of a group of things. This data looks something like this:

create_random_sets = function(n_sets = 1000){
  set.seed(1234)
  
  sets = purrr::map(seq(5, n_sets), ~ sample(seq(1, .x), 5))
  
  item_sets = sample(seq(1, length(sets)), 10000, replace = TRUE)
  item_mapping = purrr::map2_df(item_sets, seq(1, length(item_sets)), function(.x, .y){
    data.frame(v1 = as.character(.y), v2 = sets[[.x]], stringsAsFactors = FALSE)
  })
  item_mapping
}
library(dplyr)
mapped_items = create_random_sets()

head(mapped_items, 20)
##    v1  v2
## 1   1 496
## 2   1 243
## 3   1 201
## 4   1 514
## 5   1 416
## 6   2  28
## 7   2 189
## 8   2   5
## 9   2 115
## 10  2  20
## 11  3 310
## 12  3 470
## 13  3 667
## 14  3 775
## 15  3 680
## 16  4  89
## 17  4 182
## 18  4 688
## 19  4  31
## 20  4 432

Looping

In this case, every item in v1 has 5 things in v2. I really want to group multiple things of v1 that have the same combination of things in v2. My initial function to do this splits everything in v2 by v1, and then compares all the splits to each other, removing things that have been compared and found to be the same, and saving them as we go. This required two loops, basically while there was data to check, check all the other things left in the list against it (the for). Pre-initialize the list of things that are identical to each other so we don’t take a hit on allocation, and delete the things that have been checked or noted as identical. Although the variable names are changed, the code for that function is below.

loop_function = function(item_mapping){
  split_items = split(item_mapping$v2, item_mapping$v1)
  
  matched_list = vector("list", length(split_items))
  
  save_item = 1
  save_index = 1
  
  while (length(split_items) > 0) {
    curr_item = names(split_items)[save_item]
    curr_set = split_items[[save_item]]
    
    for (i_item in seq_along(split_items)) {
      if (sum(split_items[[i_item]] %in% curr_set) == length(curr_set)) {
        matching_items = unique(c(curr_item, names(split_items)[i_item]))
        save_item = unique(c(save_item, i_item))
      }
    }
    matched_list[[save_index]] = curr_set
    split_items = split_items[-save_item]
    save_index = save_index + 1
    save_item = 1
  }
  
  n_in_set = purrr::map_int(matched_list, length)
  matched_list = matched_list[n_in_set > 0]
  n_in_set = n_in_set[n_in_set > 0]
  matched_list
}

The code works, but it doesn’t really make me think about what it’s doing, the two loops hide the fact that what is really going on is comparing things to one another. Miles McBain recently posted on this fact, that loops can be necessary, but one should really think about whether they are really necessary, or do they hide something about the data, and can we think about different ways to do the same thing.

This made me realize that what I really wanted to do was split the items in v1 by the unique combinations of things in v2, because split will group things together nicely for you, without any extra work. But I don’t have those combinations in a way that split can use them. So my solution is to iterate over the splits using purrr, create a representation of the group as a character value, and then call split again at the very end based on the character representation.

split_function = function(item_mapping){
  mapped_data = split(item_mapping$v2, item_mapping$v1) %>%
    purrr::map2_dfr(., names(.), function(.x, .y){
      set = unique(.x)
      tmp_frame = data.frame(item = .y, set_chr = paste(set, collapse = ","), stringsAsFactors = FALSE)
      tmp_frame$set = list(set)
      tmp_frame
    })
  matched_list = split(mapped_data, mapped_data$set_chr)
}

Not only is the code cleaner, the grouping is explicit (as long as you know how split works), and its also 4x faster!

microbenchmark::microbenchmark(
  loop_function(mapped_items),
  split_function(mapped_items),
  times = 5
)
## Unit: seconds
##                          expr      min       lq     mean   median       uq
##   loop_function(mapped_items) 6.625086 6.755657 6.799823 6.764101 6.784886
##  split_function(mapped_items) 2.074691 2.077382 2.119430 2.106470 2.146626
##       max neval
##  7.069387     5
##  2.191984     5

Related