## TL;DR

If you notice yourself using `split`

-> `unsplit`

/ `rbind`

on two object to
match items up, maybe you should be using `dplyr::join_`

instead. Read below
for concrete examples.

## Motivation

I have had a lot of calculations lately that involve some sort of `normalization`

or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have `1e5`

values in `10`

groups, each group
of values being `normalized`

by their own value.

```
library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))
normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))
```

For each `group`

in `to_normalize`

, we want to apply the normalization factor in
`normalization`

. In this case, I’m going to do a simple subtraction.

## Match Them!

My initial implementation was to iterate over the groups, and use `%in%`

to
`match`

each `group`

from the normalization factors and the data to be normalized,
and modify in place. **Don’t do this!!** It was
the slowest method I’ve used in my real package code!

```
match_normalization <- function(normalize_data, normalization_factors){
use_groups <- normalization_factors$group
for (igroup in use_groups) {
normalize_data[normalize_data$group %in% igroup, "value"] <-
normalize_data[normalize_data$group %in% igroup, "value"] - normalization_factors[normalization_factors$group %in% igroup, "normalization"]
}
normalize_data
}
```

```
micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 37.38068 | 39.51295 | 41.10837 | 40.21366 | 41.46579 | 71.33775 | 100 |

Not bad for the test data. But can we do better?

## Split Them!

My next thought was to split them by their `group`

s, and then iterate again over
the groups using `purrr::map`

, and then unlist them.

```
split_normalization <- function(normalize_data, normalization_factors){
split_norm <- split(normalization_factors$normalization, normalization_factors$group)
split_data <- split(normalize_data, normalize_data$group)
out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
.x$value <- .x$value - .y
.x
})
do.call(rbind, out_data)
}
```

```
micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 37.43252 | 45.11638 | 48.32913 | 47.12768 | 50.36758 | 94.19223 | 100 |

split_normalization(to_normalize, normalization) | 77.30754 | 83.09115 | 88.22919 | 86.74504 | 90.65582 | 142.00019 | 100 |

## Join Them!

My final thought was to join the two data.frame’s together using `dplyr`

, and then
they are automatically matched up.

```
join_normalization <- function(normalize_data, normalization_factors){
normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
by = "group")
normalize_data$value <- normalize_data$value - normalize_data$normalization
normalize_data[, c("value", "group")]
}
```

```
micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 37.293244 | 45.476034 | 50.465097 | 47.374326 | 52.58932 | 109.4402 | 100 |

split_normalization(to_normalize, normalization) | 70.600139 | 82.395249 | 87.115287 | 86.656357 | 91.43579 | 130.6727 | 100 |

join_normalization(to_normalize, normalization) | 4.168829 | 4.525986 | 7.020724 | 4.722218 | 5.17179 | 171.3386 | 100 |

## Conclusions

So on my computer, the `split`

and `match`

implementations are mostly comparable,
although on my motivating real world example, I actually got a 3X speedup by
using the `split`

method. That may be because of issues related to `DataFrame`

and matching elements within that structure. The `join`

method is 10-14X faster
than the others, which is what I’ve seen in my motivating work. I also think
it makes the code easier to read and reason over, because you can see what
is being subtracted from what directly in the code.