Following from my last post, I am going to go step by step through the process I use to generate an analysis as a package vignette. This will be an analysis of the tweets from the 2012 and 2014 ISMB conference (thanks to Neil and Stephen for compiling the data).
I will link to individual commits so that you can see how things change as we go along.
To start, we will initialize the package.
rstudio make this rather easy:
Creating package ismbTweetAnalysis in ~/Documents/projects/personal No DESCRIPTION found. Creating with values: Package: ismbTweetAnalysis Title: What the package does (short line) Version: 0.1 Authors@R: "First Last <firstname.lastname@example.org> [aut, cre]" Description: What the package does (paragraph) Depends: R (>= 3.0.3) License: What license is it under? LazyData: true Adding Rstudio project file to ismbTweetAnalysis
Alternatively, you can use
File > New Project > New Directory > R Package in
rstudio. Don’t forget to
Create a git repository (or
git init in the directory). Note that the
devtools created package will pass
CRAN tests, whereas the
rstudio will not.
DESCRIPTION file, and you will need to change the
License, and add
VignetteBuilder: knitr at the end. Here is what my initial setup looks like.
RStudio Project Options
In addition, to make our life easier, we will change some options in the
Tools > Project Options > Build Tools, check
Generate documentation with Roxygen, and select turn on all the options. We want to
roxygenize when we
Build & Reload especially, and have
roxygen control the
NAMESPACE file so we don’t worry about it.
Alternatively, you can use
devtools to update documentation and reload the functions.
Having this particular option of
reloading the package every time I write a new function is what makes this easy. I write the new function,
reload, and I can keep chugging along with my analysis document. And if I have to restart, I just
run all chunks to get back to where I need to be.
Now we need some data. Neil’s data from 2012 uses a CSV format, however the tweets themselves have commas, so we will download the
rdata file and use that, and also Stephen’s data from 2014. However, there are three separate files for 2014, so we will download all three files and combine them. Both initial data sets will go in the
/inst/extdata folder, and we will clean them up.
Here we have added our 4 data files.
We are going to write this analysis as the vignette of the package, using
R markdown as the language. To do that we need to create the file and add some boilerplate at the top so that the vignette gets generated properly. Here is the initial vignette, it is nothing but the
index definition, which are important.
Start the Analysis!
At this point we can start the analysis. The actual analysis will be done in the
Rmd vignette file. The basic process is to add prose describing the analysis, with actual code to generate results and figures embedded in the
Rmd, and adding functions and documentation (as
roxygen tags) in the
.R file, while doing iterations of
Build/Reload along the way. Iterations of
Build/Reload after writing new functions in the
.R file will make them available to us in our workspace, with tab-completion in
The following are bullet point summaries of points when I committed or built/reloaded, with links to the commit so you can see what has changed in the package.
- Adding description of data sources to analysis
- Munging 2012 data a little, saving, and documenting
- Now we can load up this data with
- Function written and exported for reading ST’s data files
- Read in, combine, and re-save ST’s archive
- Note that this and previous chunk have
eval=FALSE, so that they are not run in the analysis, but they were run interactively while I was doing the work.
- Simple histogram of 2012 tweets by day
- Making a counting function by
- Examining the top tweeters using previous function
- Fixing data files, because of issues with having the same named object in different RData files
- Examining top tweeters in 2014
- Density of tweets with respect to starting time of conference
- Counting how often a specific tweet was retweeted
- Getting raw tweet ranks for each individual
- Examining the ranks by total retweeted tweets per user
And at this point I’m going to stop there. Now we have an analysis (that we will make into a nice output shortly), and we have munged 2 data sets, and wrote 6 functions, that may be useful in other contexts.
To preview the report, you can use the
Knit HTML button in
rstudio, or also use
knitr directly. This will give you an html preview of the final report.
Once happy with the report, you can use
devtools::build_vignettes() to generate the vignette files that will be copied to the relevant locations.
Commit and Push it ALL!
At this point, if you are happy with the package and analysis as a whole, you should commit all the package files to version control and make it available. In this case this means:
- inst/doc: the output vignette
- man: the function documentation
- DESCRIPTION: our description file
- NAMESPACE: the file documenting our namespace
You can see this commit here.
Now your package can be installed by others using
devtools::install_github(). You could also submit your package to
Bioconductor if so desired.
I hope that you find this example useful, and will consider using
packages more often even for simple analyses.
- Reproducibility: One issue that may come up is how to make sure that you or someone else can directly reproduce the work in your package. Again, Hadley Wickham and the
rstudioteam have been thinking about this, and there is now the
packratpackage to make a project completely self-contained with all of it’s dependencies.
Edit 2014-07-28 - added note on reproducibility at the end.