```
library(microbenchmark)
library(profvis)
set.seed(1234)
<- 1e5
n_point <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))
to_normalize
<- data.frame(group = seq_len(10), normalization = rnorm(10)) normalization
```

## TL;DR

If you notice yourself using `split`

-> `unsplit`

/ `rbind`

on two object to match items up, maybe you should be using `dplyr::join_`

instead. Read below for concrete examples.

## Motivation

I have had a lot of calculations lately that involve some sort of `normalization`

or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have `1e5`

values in `10`

groups, each group of values being `normalized`

by their own value.

For each `group`

in `to_normalize`

, we want to apply the normalization factor in `normalization`

. In this case, I’m going to do a simple subtraction.

## Match Them!

My initial implementation was to iterate over the groups, and use `%in%`

to `match`

each `group`

from the normalization factors and the data to be normalized, and modify in place. **Don’t do this!!** It was the slowest method I’ve used in my real package code!

```
<- function(normalize_data, normalization_factors){
match_normalization <- normalization_factors$group
use_groups
for (igroup in use_groups) {
$group %in% igroup, "value"] <-
normalize_data[normalize_data$group %in% igroup, "value"] - normalization_factors[normalization_factors$group %in% igroup, "normalization"]
normalize_data[normalize_data
}
normalize_data }
```

```
<- summary(microbenchmark(match_normalization(to_normalize, normalization)))
micro_results ::kable(micro_results) knitr
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 49.75035 | 53.45793 | 57.42852 | 54.73885 | 57.50818 | 115.9432 | 100 |

Not bad for the test data. But can we do better?

## Split Them!

My next thought was to split them by their `group`

s, and then iterate again over the groups using `purrr::map`

, and then unlist them.

```
<- function(normalize_data, normalization_factors){
split_normalization <- split(normalization_factors$normalization, normalization_factors$group)
split_norm
<- split(normalize_data, normalize_data$group)
split_data
<- purrr::map2(split_data, split_norm, function(.x, .y){
out_data $value <- .x$value - .y
.x
.x
})do.call(rbind, out_data)
}
```

```
<- summary(microbenchmark(match_normalization(to_normalize, normalization),
micro_results2 split_normalization(to_normalize, normalization)))
::kable(micro_results2) knitr
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 54.09506 | 69.0412 | 75.46899 | 73.44315 | 78.03191 | 160.8331 | 100 |

split_normalization(to_normalize, normalization) | 103.18837 | 133.9091 | 141.51288 | 142.00616 | 148.73655 | 197.9781 | 100 |

## Join Them!

My final thought was to join the two data.frame’s together using `dplyr`

, and then they are automatically matched up.

```
<- function(normalize_data, normalization_factors){
join_normalization <- dplyr::right_join(normalize_data, normalization_factors,
normalize_data by = "group")
$value <- normalize_data$value - normalize_data$normalization
normalize_datac("value", "group")]
normalize_data[, }
```

```
<- summary(microbenchmark(match_normalization(to_normalize, normalization),
micro_results3 split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
::kable(micro_results3) knitr
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 56.04426 | 68.50138 | 77.23205 | 74.45665 | 82.25683 | 184.9358 | 100 |

split_normalization(to_normalize, normalization) | 106.96651 | 135.82308 | 148.88644 | 145.27503 | 154.41702 | 284.8173 | 100 |

join_normalization(to_normalize, normalization) | 11.60424 | 13.26513 | 21.29059 | 14.18384 | 15.38526 | 488.3713 | 100 |

## Conclusions

So on my computer, the `split`

and `match`

implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the `split`

method. That may be because of issues related to `DataFrame`

and matching elements within that structure. The `join`

method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.

## Reuse

## Citation

```
@online{mflight2018,
author = {Robert M Flight},
title = {Split - {Unsplit} {Anti-Pattern}},
date = {2018-07-17},
url = {https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern},
langid = {en}
}
```