# Split - Unsplit Anti-Pattern

Getting some speed using dplyr::join than my more intuitive split –> unsplit pattern.

R
development
programming
purrr
dplyr
join

## TL;DR

If you notice yourself using `split` -> `unsplit` / `rbind` on two object to match items up, maybe you should be using `dplyr::join_` instead. Read below for concrete examples.

## Motivation

I have had a lot of calculations lately that involve some sort of `normalization` or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have `1e5` values in `10` groups, each group of values being `normalized` by their own value.

``````library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))

normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))``````

For each `group` in `to_normalize`, we want to apply the normalization factor in `normalization`. In this case, I’m going to do a simple subtraction.

## Match Them!

My initial implementation was to iterate over the groups, and use `%in%` to `match` each `group` from the normalization factors and the data to be normalized, and modify in place. Don’t do this!! It was the slowest method I’ve used in my real package code!

``````match_normalization <- function(normalize_data, normalization_factors){
use_groups <- normalization_factors\$group

for (igroup in use_groups) {
normalize_data[normalize_data\$group %in% igroup, "value"] <-
normalize_data[normalize_data\$group %in% igroup, "value"] - normalization_factors[normalization_factors\$group %in% igroup, "normalization"]
}
normalize_data
}``````
``````micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 49.75035 53.45793 57.42852 54.73885 57.50818 115.9432 100

Not bad for the test data. But can we do better?

## Split Them!

My next thought was to split them by their `group`s, and then iterate again over the groups using `purrr::map`, and then unlist them.

``````split_normalization <- function(normalize_data, normalization_factors){
split_norm <- split(normalization_factors\$normalization, normalization_factors\$group)

split_data <- split(normalize_data, normalize_data\$group)

out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
.x\$value <- .x\$value - .y
.x
})
do.call(rbind, out_data)
}``````
``````micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 54.09506 69.0412 75.46899 73.44315 78.03191 160.8331 100
split_normalization(to_normalize, normalization) 103.18837 133.9091 141.51288 142.00616 148.73655 197.9781 100

## Join Them!

My final thought was to join the two data.frame’s together using `dplyr`, and then they are automatically matched up.

``````join_normalization <- function(normalize_data, normalization_factors){
normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
by = "group")

normalize_data\$value <- normalize_data\$value - normalize_data\$normalization
normalize_data[, c("value", "group")]
}``````
``````micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 56.04426 68.50138 77.23205 74.45665 82.25683 184.9358 100
split_normalization(to_normalize, normalization) 106.96651 135.82308 148.88644 145.27503 154.41702 284.8173 100
join_normalization(to_normalize, normalization) 11.60424 13.26513 21.29059 14.18384 15.38526 488.3713 100

## Conclusions

So on my computer, the `split` and `match` implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the `split` method. That may be because of issues related to `DataFrame` and matching elements within that structure. The `join` method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.

## Citation

BibTeX citation:
``````@online{mflight2018,
author = {Robert M Flight},
title = {Split - {Unsplit} {Anti-Pattern}},
date = {2018-07-17},
url = {https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern},
langid = {en}
}
``````