# Split - Unsplit Anti-Pattern

## TL;DR

If you notice yourself using `split` -> `unsplit` / `rbind` on two object to match items up, maybe you should be using `dplyr::join_` instead. Read below for concrete examples.

## Motivation

I have had a lot of calculations lately that involve some sort of `normalization` or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have `1e5` values in `10` groups, each group of values being `normalized` by their own value.

``````library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))

normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))``````

For each `group` in `to_normalize`, we want to apply the normalization factor in `normalization`. In this case, I’m going to do a simple subtraction.

## Match Them!

My initial implementation was to iterate over the groups, and use `%in%` to `match` each `group` from the normalization factors and the data to be normalized, and modify in place. Don’t do this!! It was the slowest method I’ve used in my real package code!

``````match_normalization <- function(normalize_data, normalization_factors){
use_groups <- normalization_factors\$group

for (igroup in use_groups) {
normalize_data[normalize_data\$group %in% igroup, "value"] <-
normalize_data[normalize_data\$group %in% igroup, "value"] - normalization_factors[normalization_factors\$group %in% igroup, "normalization"]
}
normalize_data
}``````
``````micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 37.38068 39.51295 41.10837 40.21366 41.46579 71.33775 100

Not bad for the test data. But can we do better?

## Split Them!

My next thought was to split them by their `group`s, and then iterate again over the groups using `purrr::map`, and then unlist them.

``````split_normalization <- function(normalize_data, normalization_factors){
split_norm <- split(normalization_factors\$normalization, normalization_factors\$group)

split_data <- split(normalize_data, normalize_data\$group)

out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
.x\$value <- .x\$value - .y
.x
})
do.call(rbind, out_data)
}``````
``````micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 37.43252 45.11638 48.32913 47.12768 50.36758 94.19223 100
split_normalization(to_normalize, normalization) 77.30754 83.09115 88.22919 86.74504 90.65582 142.00019 100

## Join Them!

My final thought was to join the two data.frame’s together using `dplyr`, and then they are automatically matched up.

``````join_normalization <- function(normalize_data, normalization_factors){
normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
by = "group")

normalize_data\$value <- normalize_data\$value - normalize_data\$normalization
normalize_data[, c("value", "group")]
}``````
``````micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)``````
expr min lq mean median uq max neval
match_normalization(to_normalize, normalization) 37.293244 45.476034 50.465097 47.374326 52.58932 109.4402 100
split_normalization(to_normalize, normalization) 70.600139 82.395249 87.115287 86.656357 91.43579 130.6727 100
join_normalization(to_normalize, normalization) 4.168829 4.525986 7.020724 4.722218 5.17179 171.3386 100

## Conclusions

So on my computer, the `split` and `match` implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the `split` method. That may be because of issues related to `DataFrame` and matching elements within that structure. The `join` method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.