for loops often hide much of the actual logic of your code because of all the necessary boilerplate of running a loop. split-ting your data can oftentimes be clearer, and faster.
R
for-loop
split
purrr
development
Author
Robert M Flight
Published
February 13, 2019
TL;DR
Sometimes for loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they might still be slow(er) than other ways of doing things.
Comparing Groups
I have some code where I am trying to determine duplicates of a group of things. This data looks something like this:
In this case, every item in v1 has 5 things in v2. I really want to group multiple things of v1 that have the same combination of things in v2. My initial function to do this splits everything in v2 by v1, and then compares all the splits to each other, removing things that have been compared and found to be the same, and saving them as we go. This required two loops, basically while there was data to check, check all the other things left in the list against it (the for). Pre-initialize the list of things that are identical to each other so we don’t take a hit on allocation, and delete the things that have been checked or noted as identical. Although the variable names are changed, the code for that function is below.
The code works, but it doesn’t really make me think about what it’s doing, the two loops hide the fact that what is really going on is comparing things to one another. Miles McBain recently posted on this fact, that loops can be necessary, but one should really think about whether they are really necessary, or do they hide something about the data, and can we think about different ways to do the same thing.
This made me realize that what I really wanted to do was split the items in v1 by the unique combinations of things in v2, because split will group things together nicely for you, without any extra work. But I don’t have those combinations in a way that split can use them. So my solution is to iterate over the splits using purrr, create a representation of the group as a character value, and then call split again at the very end based on the character representation.
@online{mflight2019,
author = {Robert M Flight},
title = {Comparisons Using for Loops Vs Split},
date = {2019-02-13},
url = {https://rmflight.github.io/posts/2019-02-13-for-loops-vs-split},
langid = {en}
}