R

Don't do PCA After Statistical Testing!

TL;DR If you do a statistical test before a dimensional reduction method like PCA, the highest source of variance is likely to be whatever you tested statistically. Wait, Why?? Let me describe the situation. You’ve done an -omics level analysis on your system of interest. You run a t-test (or ANOVA, etc) on each of the features in your data (gene, protein, metabolite, etc). Filter down to those things that were statistically significant, and then finally, you decide to look at the data using a dimensionality reduction method such as principal components analysis (PCA) so you can see what is going on.

Finding Modes Using Kernel Density Estimates

TL; DR If you have a unimodal distribution of values, you can use R’s density or Scipy’s gaussian_kde to create density estimates of the data, and then take the maxima of the density estimate to get the mode. See below for actual examples in R and Python. Mode in R First, lets do this in R. Need some values to work with. library(ggplot2) set.seed(1234) n_point <- 1000 data_df <- data.

Split - Unsplit Anti-Pattern

TL;DR If you notice yourself using split -> unsplit / rbind on two object to match items up, maybe you should be using dplyr::join_ instead. Read below for concrete examples. Motivation I have had a lot of calculations lately that involve some sort of normalization or scaling a group of related values, each group by a different factor. Lets setup an example where we will have 1e5 values in 10 groups, each group of values being normalized by their own value.

Using IRanges for Non-Integer Overlaps

TL;DR The IRanges package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need. IRanges?? IRanges is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.

knitrProgressBar Package

TL;DR If you like dplyr progress bars, and wished you could use them everywhere, including from within Rmd documents, non-interactive shells, etc, then you should check out knitrProgressBar (cran github). Why Yet Another Progress Bar?? I didn’t set out to create another progress bar package. But I really liked dplyrs style of progress bar, and how they worked under the hood (thanks to the examples from Bob Rudis). As I used them, I noticed that no progress was displayed if you did rmarkdown::render() or knitr::knit().

Licensing R Packages that Include Others Code

TL;DR If you include others code in your own R package, list them as contributors with comments about what they contributed, and add a license statement in the file that includes their code. Motivation I recently created the knitrProgressBar package. It is a really simple package, that takes the dplyr progress bars and makes it possible for them to write progress to a supplied file connection. The dplyr package itself is licensed under MIT, so I felt fine taking the code directly from dplyr itself.

docopt & Numeric Options

TL;DR If you use the docopt package to create command line R executables that take options, there is something to know about numeric command line options: they should have as.double before using them in your script. Setup Lets set up a new docopt string, that includes both string and numeric arguments. " Usage: test_numeric.R [–string=<string_value>] [–numeric=<numeric_value>] test_numeric.R (-h | –help) test_numeric.R Description: Testing how values are passed using docopt. Options: –string=<string_value> A string value [default: Hi!

Custom Deployment Script

TL;DR Use a short bash script to do deployment from your own computer directly to your *.github.io domain. Why? So Yihui recommends using Netlify, or even Travis-CI in the Blogdown book. I wasn’t willing to setup a custom domain yet, and some of my posts involve a lot of personally created packages, etc, that I don’t want to debug installation on Travis. So, I wanted a simple script I could call on my laptop that would copy the /public directory to the repo for my github.

Linking to Manually Inserted Images in Blogdown / Hugo

Manual Linking? Using blogdown for generating websites and blog-posts from Rmarkdown files with lots of inserted code and figures seems pretty awesome, but sometimes you want to include a figure manually, either because you want to generate something manually and convert it (say for going from SVG of lots of points to hi-res PNG), or because it is a figure from something else (like this figure from wikipedia). Where to?

Travis-CI to GitHub Pages

I don’t remember how I got on this, but I believe I had a recent twitter exchange with some persons (or saw it fly by) about pushing R package vignettes to the web after building and checking on travis-ci. Hadley Wickham pointed to using such a scheme to push the web version of his book after each update and the S3 deploy hooks on travis-ci. Deploying your html content to S3 is great, but given the availability of the gh-pages branch on GitHub, I thought it would be neat to work out how to deploy the html output from an R package vignette to the gh-pages branch on GitHub.