TL;DR
Sometimes for
loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they
might still be slow(er) than other ways of doing things.
Comparing Groups
I have some code where I am trying to determine duplicates of a group of things. This data looks something like this:
create_random_sets = function(n_sets = 1000){
set.seed(1234)
sets = purrr::map(seq(5, n_sets), ~ sample(seq(1, .x), 5))
item_sets = sample(seq(1, length(sets)), 10000, replace = TRUE)
item_mapping = purrr::map2_df(item_sets, seq(1, length(item_sets)), function(.x, .y){
data.frame(v1 = as.character(.y), v2 = sets[[.x]], stringsAsFactors = FALSE)
})
item_mapping
}
library(dplyr)
mapped_items = create_random_sets()
head(mapped_items, 20)
## v1 v2
## 1 1 496
## 2 1 243
## 3 1 201
## 4 1 514
## 5 1 416
## 6 2 28
## 7 2 189
## 8 2 5
## 9 2 115
## 10 2 20
## 11 3 310
## 12 3 470
## 13 3 667
## 14 3 775
## 15 3 680
## 16 4 89
## 17 4 182
## 18 4 688
## 19 4 31
## 20 4 432
Looping
In this case, every item
in v1
has 5 things in v2
. I really want to group multiple things of v1
that have the same combination of things in v2
. My initial function
to do this split
s everything in v2
by v1
, and then compares all the splits to each other, removing things that have been compared and found to be the same, and saving them
as we go. This required two loops, basically while
there was data to check, check all the other things left in the list against it (the for
).
Pre-initialize the list of things that are identical to each other so we don’t take a hit on allocation, and delete the things that have been checked or noted as identical. Although the variable names are changed, the code for that function is below.
loop_function = function(item_mapping){
split_items = split(item_mapping$v2, item_mapping$v1)
matched_list = vector("list", length(split_items))
save_item = 1
save_index = 1
while (length(split_items) > 0) {
curr_item = names(split_items)[save_item]
curr_set = split_items[[save_item]]
for (i_item in seq_along(split_items)) {
if (sum(split_items[[i_item]] %in% curr_set) == length(curr_set)) {
matching_items = unique(c(curr_item, names(split_items)[i_item]))
save_item = unique(c(save_item, i_item))
}
}
matched_list[[save_index]] = curr_set
split_items = split_items[-save_item]
save_index = save_index + 1
save_item = 1
}
n_in_set = purrr::map_int(matched_list, length)
matched_list = matched_list[n_in_set > 0]
n_in_set = n_in_set[n_in_set > 0]
matched_list
}
The code works, but it doesn’t really make me think about what it’s doing, the two loops hide the fact that what is really going on is comparing things to one another. Miles McBain recently posted on this fact, that loops can be necessary, but one should really think about whether they are really necessary, or do they hide something about the data, and can we think about different ways to do the same thing.
This made me realize that what I really wanted to do was split
the items in v1
by the unique combinations of things in v2
, because split
will group things together nicely
for you, without any extra work. But I don’t have those combinations in a way that split
can use them. So my solution is to iterate over the splits using purrr
,
create a representation of the group as a character value, and then call split
again at the very end based on the character
representation.
split_function = function(item_mapping){
mapped_data = split(item_mapping$v2, item_mapping$v1) %>%
purrr::map2_dfr(., names(.), function(.x, .y){
set = unique(.x)
tmp_frame = data.frame(item = .y, set_chr = paste(set, collapse = ","), stringsAsFactors = FALSE)
tmp_frame$set = list(set)
tmp_frame
})
matched_list = split(mapped_data, mapped_data$set_chr)
}
Not only is the code cleaner, the grouping is explicit (as long as you know how split
works), and its also 4x faster!
microbenchmark::microbenchmark(
loop_function(mapped_items),
split_function(mapped_items),
times = 5
)
## Unit: seconds
## expr min lq mean median uq
## loop_function(mapped_items) 6.625086 6.755657 6.799823 6.764101 6.784886
## split_function(mapped_items) 2.074691 2.077382 2.119430 2.106470 2.146626
## max neval
## 7.069387 5
## 2.191984 5