Deciphering Life: One Bit at a Time - Comparisons using for loops vs split

TL;DR

Sometimes for loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they might still be slow(er) than other ways of doing things.

Comparing Groups

I have some code where I am trying to determine duplicates of a group of things. This data looks something like this:

create_random_sets = function(n_sets = 1000){
  set.seed(1234)
  
  sets = purrr::map(seq(5, n_sets), ~ sample(seq(1, .x), 5))
  
  item_sets = sample(seq(1, length(sets)), 10000, replace = TRUE)
  item_mapping = purrr::map2_df(item_sets, seq(1, length(item_sets)), function(.x, .y){
    data.frame(v1 = as.character(.y), v2 = sets[[.x]], stringsAsFactors = FALSE)
  })
  item_mapping
}
library(dplyr)
mapped_items = create_random_sets()

head(mapped_items, 20)

Looping

In this case, every item in v1 has 5 things in v2. I really want to group multiple things of v1 that have the same combination of things in v2. My initial function to do this splits everything in v2 by v1, and then compares all the splits to each other, removing things that have been compared and found to be the same, and saving them as we go. This required two loops, basically while there was data to check, check all the other things left in the list against it (the for). Pre-initialize the list of things that are identical to each other so we don’t take a hit on allocation, and delete the things that have been checked or noted as identical. Although the variable names are changed, the code for that function is below.

loop_function = function(item_mapping){
  split_items = split(item_mapping$v2, item_mapping$v1)
  
  matched_list = vector("list", length(split_items))
  
  save_item = 1
  save_index = 1
  
  while (length(split_items) > 0) {
    curr_item = names(split_items)[save_item]
    curr_set = split_items[[save_item]]
    
    for (i_item in seq_along(split_items)) {
      if (sum(split_items[[i_item]] %in% curr_set) == length(curr_set)) {
        matching_items = unique(c(curr_item, names(split_items)[i_item]))
        save_item = unique(c(save_item, i_item))
      }
    }
    matched_list[[save_index]] = curr_set
    split_items = split_items[-save_item]
    save_index = save_index + 1
    save_item = 1
  }
  
  n_in_set = purrr::map_int(matched_list, length)
  matched_list = matched_list[n_in_set > 0]
  n_in_set = n_in_set[n_in_set > 0]
  matched_list
}

The code works, but it doesn’t really make me think about what it’s doing, the two loops hide the fact that what is really going on is comparing things to one another. Miles McBain recently posted on this fact, that loops can be necessary, but one should really think about whether they are really necessary, or do they hide something about the data, and can we think about different ways to do the same thing.

This made me realize that what I really wanted to do was split the items in v1 by the unique combinations of things in v2, because split will group things together nicely for you, without any extra work. But I don’t have those combinations in a way that split can use them. So my solution is to iterate over the splits using purrr, create a representation of the group as a character value, and then call split again at the very end based on the character representation.

split_function = function(item_mapping){
  mapped_data = split(item_mapping$v2, item_mapping$v1) %>%
    purrr::map2_dfr(., names(.), function(.x, .y){
      set = unique(.x)
      tmp_frame = data.frame(item = .y, set_chr = paste(set, collapse = ","), stringsAsFactors = FALSE)
      tmp_frame$set = list(set)
      tmp_frame
    })
  matched_list = split(mapped_data, mapped_data$set_chr)
}

Not only is the code cleaner, the grouping is explicit (as long as you know how split works), and its also 4x faster!

microbenchmark::microbenchmark(
  loop_function(mapped_items),
  split_function(mapped_items),
  times = 5
)

Unit: seconds
                         expr       min        lq      mean    median        uq
  loop_function(mapped_items) 10.672729 10.694685 10.914005 10.872513 10.984026
 split_function(mapped_items)  3.011281  3.013769  3.082842  3.089458  3.131173
      max neval
 11.34607     5
  3.16853     5

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{mflight2019,
  author = {Robert M Flight},
  title = {Comparisons Using for Loops Vs Split},
  date = {2019-02-13},
  url = {https://rmflight.github.io/posts/2019-02-13-for-loops-vs-split},
  langid = {en}
}

For attribution, please cite this work as:

Robert M Flight. 2019. “Comparisons Using for Loops Vs Split.” February 13, 2019. https://rmflight.github.io/posts/2019-02-13-for-loops-vs-split.