Split - Unsplit Anti-Pattern

Getting some speed using dplyr::join than my more intuitive split –> unsplit pattern.

Robert M Flight
2018-07-17

TL;DR

If you notice yourself using split -> unsplit / rbind on two object to match items up, maybe you should be using dplyr::join_ instead. Read below for concrete examples.

Motivation

I have had a lot of calculations lately that involve some sort of normalization or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have 1e5 values in 10 groups, each group of values being normalized by their own value.

library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))

normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))

For each group in to_normalize, we want to apply the normalization factor in normalization. In this case, I’m going to do a simple subtraction.

Match Them!

My initial implementation was to iterate over the groups, and use %in% to match each group from the normalization factors and the data to be normalized, and modify in place. Don’t do this!! It was the slowest method I’ve used in my real package code!

match_normalization <- function(normalize_data, normalization_factors){
use_groups <- normalization_factors\$group

for (igroup in use_groups) {
normalize_data[normalize_data\$group %in% igroup, "value"] <-
normalize_data[normalize_data\$group %in% igroup, "value"] - normalization_factors[normalization_factors\$group %in% igroup, "normalization"]
}
normalize_data
}
micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 33.52854 35.79094 36.64787 36.27017 36.78996 65.96424 100

Not bad for the test data. But can we do better?

Split Them!

My next thought was to split them by their groups, and then iterate again over the groups using purrr::map, and then unlist them.

split_normalization <- function(normalize_data, normalization_factors){
split_norm <- split(normalization_factors\$normalization, normalization_factors\$group)

split_data <- split(normalize_data, normalize_data\$group)

out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
.x\$value <- .x\$value - .y
.x
})
do.call(rbind, out_data)
}
micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 33.71692 41.09938 43.17703 42.23761 43.82695 91.57123 100
split_normalization(to_normalize, normalization) 77.33925 85.10808 91.23737 91.13441 94.35732 161.79138 100

Join Them!

My final thought was to join the two data.frame’s together using dplyr, and then they are automatically matched up.

join_normalization <- function(normalize_data, normalization_factors){
normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
by = "group")

normalize_data\$value <- normalize_data\$value - normalize_data\$normalization
normalize_data[, c("value", "group")]
}
micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 33.974007 41.813316 45.47936 43.089628 46.33666 104.7553 100
split_normalization(to_normalize, normalization) 76.452060 85.275248 90.35035 88.315268 92.88813 152.2833 100
join_normalization(to_normalize, normalization) 7.743028 8.386193 13.59934 8.650142 14.31905 292.2638 100

Conclusions

So on my computer, the split and match implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the split method. That may be because of issues related to DataFrame and matching elements within that structure. The join method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".