Following from my last post, I am going to go step by step through the process I use to generate an analysis as a package vignette. This will be an analysis of the tweets from the 2012 and 2014 ISMB conference (thanks to Neil and Stephen for compiling the data).
I will link to individual commits so that you can see how things change as we go along.
Setup
Initialization
To start, we will initialize the package. devtools
or rstudio
make this rather easy:
library(devtools)
create("~/Documents/projects/personal/ismbTweetAnalysis")
Creating package ismbTweetAnalysis in ~/Documents/projects/personal
No DESCRIPTION found. Creating with values:
Package: ismbTweetAnalysis
Title: What the package does (short line)
Version: 0.1
Authors@R: "First Last <first.last@example.com> [aut, cre]"
Description: What the package does (paragraph)
Depends: R (>= 3.0.3)
License: What license is it under?
LazyData: true
Adding Rstudio project file to ismbTweetAnalysis
Alternatively, you can use File > New Project > New Directory > R Package
in rstudio
. Don’t forget to Create a git repository
(or git init
in the directory). Note that the devtools
created package will pass CRAN
tests, whereas the rstudio
will not.
Open the DESCRIPTION
file, and you will need to change the Title
, Authors
or Authors@R
, Description
, License
, and add VignetteBuilder: knitr
at the end. Here is what my initial setup looks like.
RStudio Project Options
In addition, to make our life easier, we will change some options in the rstudio
project.
Tools > Project Options > Build Tools
, check Generate documentation with Roxygen
, and select turn on all the options. We want to roxygenize
when we Build & Reload
especially, and have roxygen
control the NAMESPACE
file so we don’t worry about it.
Alternatively, you can use document
with reload=TRUE
in devtools
to update documentation and reload the functions.
Having this particular option of documenting
and reloading
the package every time I write a new function is what makes this easy. I write the new function, document
/reload
, and I can keep chugging along with my analysis document. And if I have to restart, I just run all
chunks to get back to where I need to be.
Data
Now we need some data. Neil’s data from 2012 uses a CSV format, however the tweets themselves have commas, so we will download the rdata
file and use that, and also Stephen’s data from 2014. However, there are three separate files for 2014, so we will download all three files and combine them. Both initial data sets will go in the /inst/extdata
folder, and we will clean them up.
Here we have added our 4 data files.
Vignette
We are going to write this analysis as the vignette of the package, using R markdown
as the language. To do that we need to create the file and add some boilerplate at the top so that the vignette gets generated properly. Here is the initial vignette, it is nothing but the engine
and index
definition, which are important.
Start the Analysis!
At this point we can start the analysis. The actual analysis will be done in the Rmd
vignette file. The basic process is to add prose describing the analysis, with actual code to generate results and figures embedded in the Rmd
, and adding functions and documentation (as roxygen
tags) in the .R
file, while doing iterations of document
or Build/Reload
along the way. Iterations of document
/ Build/Reload
after writing new functions in the .R
file will make them available to us in our workspace, with tab-completion in rstudio
.
The following are bullet point summaries of points when I committed or built/reloaded, with links to the commit so you can see what has changed in the package.
- Adding description of data sources to analysis
- Munging 2012 data a little, saving, and documenting
- Now we can load up this data with
data(ismb2012)
- Now we can load up this data with
- Function written and exported for reading ST’s data files
- Read in, combine, and re-save ST’s archive
- Note that this and previous chunk have
eval=FALSE
, so that they are not run in the analysis, but they were run interactively while I was doing the work.
- Note that this and previous chunk have
- Simple histogram of 2012 tweets by day
- Making a counting function by
screenName
- Examining the top tweeters using previous function
- Fixing data files, because of issues with having the same named object in different RData files
- Examining top tweeters in 2014
- Density of tweets with respect to starting time of conference
- Counting how often a specific tweet was retweeted
- Getting raw tweet ranks for each individual
- Examining the ranks by total retweeted tweets per user
And at this point I’m going to stop there. Now we have an analysis (that we will make into a nice output shortly), and we have munged 2 data sets, and wrote 6 functions, that may be useful in other contexts.
Preview Report
To preview the report, you can use the Knit HTML
button in rstudio
, or also use knitr
directly. This will give you an html preview of the final report.
Generate Vignette
Once happy with the report, you can use devtools::build_vignettes()
to generate the vignette files that will be copied to the relevant locations.
Commit and Push it ALL!
At this point, if you are happy with the package and analysis as a whole, you should commit all the package files to version control and make it available. In this case this means:
- inst/doc: the output vignette
- man: the function documentation
- DESCRIPTION: our description file
- NAMESPACE: the file documenting our namespace
You can see this commit here.
Now your package can be installed by others using devtools::install_github()
. You could also submit your package to CRAN
or Bioconductor
if so desired.
Not Covered
Now this was a simple example. Ideally I should have included tests for my functions, you can read up 1 2 on how to do that. In addition, none of my functions use methods (see why they are useful).
I hope that you find this example useful, and will consider using packages
more often even for simple analyses.
- Reproducibility: One issue that may come up is how to make sure that you or someone else can directly reproduce the work in your package. Again, Hadley Wickham and the
rstudio
team have been thinking about this, and there is now thepackrat
package to make a project completely self-contained with all of it’s dependencies.
Edit 2014-07-28 - added note on reproducibility at the end.