library(microbenchmark)
library(profvis)
set.seed(1234)
<- 1e5
n_point <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))
to_normalize
<- data.frame(group = seq_len(10), normalization = rnorm(10)) normalization
TL;DR
If you notice yourself using split
-> unsplit
/ rbind
on two object to match items up, maybe you should be using dplyr::join_
instead. Read below for concrete examples.
Motivation
I have had a lot of calculations lately that involve some sort of normalization
or scaling a group of related values, each group by a different factor.
Lets setup an example where we will have 1e5
values in 10
groups, each group of values being normalized
by their own value.
For each group
in to_normalize
, we want to apply the normalization factor in normalization
. In this case, I’m going to do a simple subtraction.
Match Them!
My initial implementation was to iterate over the groups, and use %in%
to match
each group
from the normalization factors and the data to be normalized, and modify in place. Don’t do this!! It was the slowest method I’ve used in my real package code!
<- function(normalize_data, normalization_factors){
match_normalization <- normalization_factors$group
use_groups
for (igroup in use_groups) {
$group %in% igroup, "value"] <-
normalize_data[normalize_data$group %in% igroup, "value"] - normalization_factors[normalization_factors$group %in% igroup, "normalization"]
normalize_data[normalize_data
}
normalize_data }
<- summary(microbenchmark(match_normalization(to_normalize, normalization)))
micro_results ::kable(micro_results) knitr
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
match_normalization(to_normalize, normalization) | 49.75035 | 53.45793 | 57.42852 | 54.73885 | 57.50818 | 115.9432 | 100 |
Not bad for the test data. But can we do better?
Split Them!
My next thought was to split them by their group
s, and then iterate again over the groups using purrr::map
, and then unlist them.
<- function(normalize_data, normalization_factors){
split_normalization <- split(normalization_factors$normalization, normalization_factors$group)
split_norm
<- split(normalize_data, normalize_data$group)
split_data
<- purrr::map2(split_data, split_norm, function(.x, .y){
out_data $value <- .x$value - .y
.x
.x
})do.call(rbind, out_data)
}
<- summary(microbenchmark(match_normalization(to_normalize, normalization),
micro_results2 split_normalization(to_normalize, normalization)))
::kable(micro_results2) knitr
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
match_normalization(to_normalize, normalization) | 54.09506 | 69.0412 | 75.46899 | 73.44315 | 78.03191 | 160.8331 | 100 |
split_normalization(to_normalize, normalization) | 103.18837 | 133.9091 | 141.51288 | 142.00616 | 148.73655 | 197.9781 | 100 |
Join Them!
My final thought was to join the two data.frame’s together using dplyr
, and then they are automatically matched up.
<- function(normalize_data, normalization_factors){
join_normalization <- dplyr::right_join(normalize_data, normalization_factors,
normalize_data by = "group")
$value <- normalize_data$value - normalize_data$normalization
normalize_datac("value", "group")]
normalize_data[, }
<- summary(microbenchmark(match_normalization(to_normalize, normalization),
micro_results3 split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
::kable(micro_results3) knitr
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
match_normalization(to_normalize, normalization) | 56.04426 | 68.50138 | 77.23205 | 74.45665 | 82.25683 | 184.9358 | 100 |
split_normalization(to_normalize, normalization) | 106.96651 | 135.82308 | 148.88644 | 145.27503 | 154.41702 | 284.8173 | 100 |
join_normalization(to_normalize, normalization) | 11.60424 | 13.26513 | 21.29059 | 14.18384 | 15.38526 | 488.3713 | 100 |
Conclusions
So on my computer, the split
and match
implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the split
method. That may be because of issues related to DataFrame
and matching elements within that structure. The join
method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.
Reuse
Citation
@online{mflight2018,
author = {Robert M Flight},
title = {Split - {Unsplit} {Anti-Pattern}},
date = {2018-07-17},
url = {https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern},
langid = {en}
}