Edit 2022-12-03: Don’t use {drake}
, use {targets}
now. Everything else still applies.
TL;DR
I was wrong about using an R package to bundle an analysis. That happens a lot I suppose. But I needed to own up here.
Using R Packages for Analysis
A couple of years ago, I wrote at least three different blog posts describing how using R packages to bundle up an analysis was the best way to do things. I went into the benefits of this approach in general (Flight 2014a), how to do actually go through one in practice (Flight 2014b), and bashed on an alternative, ProjectTemplate (Flight 2014c). I was wrong. So, so very wrong.
What Happened in Practice?
Although I was espousing this philosophy for doing the analysis as part of an R package, in practice, it was a pain in the butt to actually carry out. Write functions, update installed package, update vignette, hope it all runs. I honestly would end up with a ton of functions written in the report vignette, and hope that {knitr} caching would save me (or have to delete the cache because something went wrong).
Eventually, I started having project directories with a report, and throwing all my functions into the top of the rmarkdown, and if I was lucky I might write objects (tables, figures, etc) out into directories that could then be sent to collaborators.
A Dim Bulb
However, {drake} was becoming popular as a proper workflow manager for R, with some very nice capabilities, and I started using it to help out with some large caching things I wanted for analyses for a manuscript. I still didn’t use it for anything else.
The documentation for {drake} very much encourages a functional workflow, where one writes functions that take inputs from other functions and generate outputs for other functions (Will Landau 2020). And though Will encourages this in the documentation, for some reason I was not grokking how this worked in practice, and how it could work easily for all of my projects with some simple conventions around folder structure borrowed from other projects (or even from R packages).
A Little Brighter
Miles McBain was talking about some of these things (around data workflows) intermittently on Twitter, and then posted about a functional approach to analyses using {drake} (McBain 2020a). When I read it, I was really, really impressed with it. Unfortunately, I still wasn’t really clear on how to implement and use his method with {drake}. This is not the fault of Miles’ post or {drake}’s documentation (see above). It is my fault for not reading it slowly, and taking the time to try and stop through it with the {dflow} implementation and {drake} documentation together.
Lightbulb Goes Off
Miles’ post was at the end of April, 2020. In August 2020, he was a guest at the New York Open Statistical Programming Meetup virtual event where he presented his {dflow} workflow in That Feeling of Workflowing (Lander 2020) (McBain 2020b). Seeing and hearing Miles explain everything, and then actually work through an example that everything about his approach finally clicked for me. And really, any project based layout that combines a directory for R scripts, and inputs and outputs too (e.g. {ProjectTemplate} and others).
Now, I know, that still sounds like an R package, with functions in /R
, reports in /vignettes
, and built in documentation. But it’s lighter weight than a package, and the functions are specific to just this one analysis.
{dflow} includes some nice boilerplate code for keeping your list of packages in one place, and getting all of your functions into the workspace. I was so impressed that I installed {dflow} the next day and working out how to convert my current analysis project to use it. Shortly after experimenting with that one analysis, I was converted. Every new project I start is being done using {dflow}, and any old projects I’ve gone back to are being converted over to use it.
I would now recommend using {drake} / {targets} / {dflow} or {ProjectTemplate} (or any other directory based system with caching) for R analysis projects. And if you have the discipline and find it works for you, then using R packages might still work (there is even a {drake} / {targets} way of working within packages that I hadn’t noticed before). However, I think the general project based ideas are just fine for many analyses, and packages are overkill and take too much work for the analysis authors to keep it up.
{drake} (and now {targets}) is a modern make like engine specifically for R, with an R based caching system (saving things so they don’t have to be repeated again) for your outputs that really, really helps you keep track of things, organize your inputs and outputs, and work in a functional way, which are all good things. {dflow} is a very pretty wrapper for setting up {drake} analysis projects.
Edited on 2021-03-03 to give more credence to Will Landau and the {drake} package itself, as Miles McBain felt I gave too much credence to him instead of {drake}.
References
Reuse
Citation
@online{mflight2021,
author = {Robert M Flight},
title = {Packages {Don’t} {Work} {Well} for {Analyses} in {Practice}},
date = {2021-03-02},
url = {https://rmflight.github.io/posts/2021-03-02-packages-dont-work-well-for-analyses-in-practice},
langid = {en}
}