Getting some speed using dplyr::join than my more intuitive split –> unsplit pattern.

If you notice yourself using `split`

-> `unsplit`

/ `rbind`

on two object to match items up, maybe you should be using `dplyr::join_`

instead. Read below for concrete examples.

I have had a lot of calculations lately that involve some sort of `normalization`

or scaling a group of related values, each group by a different factor.

Lets setup an example where we will have `1e5`

values in `10`

groups, each group of values being `normalized`

by their own value.

```
library(microbenchmark)
library(profvis)
set.seed(1234)
n_point <- 1e5
to_normalize <- data.frame(value = rnorm(n_point), group = sample(seq_len(10), n_point, replace = TRUE))
normalization <- data.frame(group = seq_len(10), normalization = rnorm(10))
```

For each `group`

in `to_normalize`

, we want to apply the normalization factor in `normalization`

. In this case, I’m going to do a simple subtraction.

My initial implementation was to iterate over the groups, and use `%in%`

to `match`

each `group`

from the normalization factors and the data to be normalized, and modify in place. **Don’t do this!!** It was the slowest method I’ve used in my real package code!

```
match_normalization <- function(normalize_data, normalization_factors){
use_groups <- normalization_factors$group
for (igroup in use_groups) {
normalize_data[normalize_data$group %in% igroup, "value"] <-
normalize_data[normalize_data$group %in% igroup, "value"] - normalization_factors[normalization_factors$group %in% igroup, "normalization"]
}
normalize_data
}
```

```
micro_results <- summary(microbenchmark(match_normalization(to_normalize, normalization)))
knitr::kable(micro_results)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 33.52854 | 35.79094 | 36.64787 | 36.27017 | 36.78996 | 65.96424 | 100 |

Not bad for the test data. But can we do better?

My next thought was to split them by their `group`

s, and then iterate again over the groups using `purrr::map`

, and then unlist them.

```
split_normalization <- function(normalize_data, normalization_factors){
split_norm <- split(normalization_factors$normalization, normalization_factors$group)
split_data <- split(normalize_data, normalize_data$group)
out_data <- purrr::map2(split_data, split_norm, function(.x, .y){
.x$value <- .x$value - .y
.x
})
do.call(rbind, out_data)
}
```

```
micro_results2 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization)))
knitr::kable(micro_results2)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 33.71692 | 41.09938 | 43.17703 | 42.23761 | 43.82695 | 91.57123 | 100 |

split_normalization(to_normalize, normalization) | 77.33925 | 85.10808 | 91.23737 | 91.13441 | 94.35732 | 161.79138 | 100 |

My final thought was to join the two data.frame’s together using `dplyr`

, and then they are automatically matched up.

```
join_normalization <- function(normalize_data, normalization_factors){
normalize_data <- dplyr::right_join(normalize_data, normalization_factors,
by = "group")
normalize_data$value <- normalize_data$value - normalize_data$normalization
normalize_data[, c("value", "group")]
}
```

```
micro_results3 <- summary(microbenchmark(match_normalization(to_normalize, normalization),
split_normalization(to_normalize, normalization),
join_normalization(to_normalize, normalization)))
knitr::kable(micro_results3)
```

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

match_normalization(to_normalize, normalization) | 33.974007 | 41.813316 | 45.47936 | 43.089628 | 46.33666 | 104.7553 | 100 |

split_normalization(to_normalize, normalization) | 76.452060 | 85.275248 | 90.35035 | 88.315268 | 92.88813 | 152.2833 | 100 |

join_normalization(to_normalize, normalization) | 7.743028 | 8.386193 | 13.59934 | 8.650142 | 14.31905 | 292.2638 | 100 |

So on my computer, the `split`

and `match`

implementations are mostly comparable, although on my motivating real world example, I actually got a 3X speedup by using the `split`

method. That may be because of issues related to `DataFrame`

and matching elements within that structure. The `join`

method is 10-14X faster than the others, which is what I’ve seen in my motivating work. I also think it makes the code easier to read and reason over, because you can see what is being subtracted from what directly in the code.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Flight (2018, July 17). Deciphering Life: One Bit at a Time: Split - Unsplit Anti-Pattern. Retrieved from https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern/

BibTeX citation

@misc{flight2018split, author = {Flight, Robert M}, title = {Deciphering Life: One Bit at a Time: Split - Unsplit Anti-Pattern}, url = {https://rmflight.github.io/posts/2018-07-17-split-unsplit-anti-pattern/}, year = {2018} }