Split - Unsplit Anti-Pattern

Getting some speed using dplyr::join than my more intuitive split –> unsplit pattern.

R
development
programming
purrr
dplyr
join
Author

Robert M Flight

Published

July 17, 2018

TL;DR

If you notice yourself using split -> unsplit / rbind on two object to match items up, maybe you should be using dplyr::join_ instead. Read below for concrete examples.

Motivation

I have had a lot of calculations lately that involve some sort of normalization or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have 1e5 values in 10 groups, each group of values being normalized by their own value.

library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))

normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))

For each group in to_normalize, we want to apply the normalization factor in normalization. In this case, I’m going to do a simple subtraction.

Match Them!

My initial implementation was to iterate over the groups, and use %in% to match each group from the normalization factors and the data to be normalized, and modify in place. Don’t do this!! It was the slowest method I’ve used in my real package code!

match_normalization <- function(normalize_data, normalization_factors){
  use_groups <- normalization_factors$group
  
  for (igroup in use_groups) {
    normalize_data[normalize_data$group %in% igroup, "value"] <- 
      normalize_data[normalize_data$group %in% igroup, "value"] - normalization_factors[normalization_factors$group %in% igroup, "normalization"]
  }
  normalize_data
}
micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 49.75035 53.45793 57.42852 54.73885 57.50818 115.9432 100

Not bad for the test data. But can we do better?

Split Them!

My next thought was to split them by their groups, and then iterate again over the groups using purrr::map, and then unlist them.

split_normalization <- function(normalize_data, normalization_factors){
  split_norm <- split(normalization_factors$normalization, normalization_factors$group)
  
  split_data <- split(normalize_data, normalize_data$group)
  
  out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
    .x$value <- .x$value - .y
    .x
  })
  do.call(rbind, out_data)
}
micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
               split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 54.09506 69.0412 75.46899 73.44315 78.03191 160.8331 100
split_normalization(to_normalize, normalization) 103.18837 133.9091 141.51288 142.00616 148.73655 197.9781 100

Join Them!

My final thought was to join the two data.frame’s together using dplyr, and then they are automatically matched up.

join_normalization <- function(normalize_data, normalization_factors){
  normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
                                      by = "group")
  
  normalize_data$value <- normalize_data$value - normalize_data$normalization
  normalize_data[, c("value", "group")]
}
micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
               split_normalization(to_normalize, normalization),
               join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 56.04426 68.50138 77.23205 74.45665 82.25683 184.9358 100
split_normalization(to_normalize, normalization) 106.96651 135.82308 148.88644 145.27503 154.41702 284.8173 100
join_normalization(to_normalize, normalization) 11.60424 13.26513 21.29059 14.18384 15.38526 488.3713 100

Conclusions

So on my computer, the split and match implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the split method. That may be because of issues related to DataFrame and matching elements within that structure. The join method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.

Reuse

Citation

BibTeX citation:
@online{mflight2018,
  author = {Robert M Flight},
  title = {Split - {Unsplit} {Anti-Pattern}},
  date = {2018-07-17},
  url = {https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern},
  langid = {en}
}
For attribution, please cite this work as:
Robert M Flight. 2018. “Split - Unsplit Anti-Pattern.” July 17, 2018. https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern.