Posts | Deciphering Life: One Bit at a Time

Using group_by Instead of Splits

2020-02-25 2 min read

TL;DR It is relatively easy to use dplyr::group_by and summarise to find items that you might want to keep or remove based on a part_of the item or group in question. I used to use split and iterate, but group_by is much easier. Motivation I have some relatively large sets of data that fall naturally into groups of items. Often, I find that I want to remove a group that contains either any of or all of particular items.

Narrower PDF Kable Tables

2019-11-06 3 min read

TL;DR Don’t bother trying to roll your own function to make narrower kable tables in a PDF document, just use kableExtra. Motivation I’ve been creating tables in a report where I really needed the table to fit, and because I am using PDF output, that means the tables can’t be any wider than the page. As I’m sure many readers might be aware, kable tables will gladly overrun the side of the page if they are too wide.

Introducing Scientific Programming

2019-10-23 5 min read

TL;DR We should get science undergraduate students programming by introducing R & Python in all first year science labs, and continuing throughout the undergraduate classes. Why? I’ve previously encountered ideas around getting graduate students to get programming, because to do the analyses that modern science requires you need to be able to at least do some basic scripting, either in a language like Python or R or on the command line.

Comments enabled via utterances

2019-10-16 1 min read

TL;DR Utterances is a lightweight commenting platform built on GitHub issues. So you have to have a GitHub account, but I expect most people who comment on this blog already have one. Why Utterances When I switched to blogdown, I lost my disqus comments. I had considered migrating them over, but never got around to it. I also thought that there had to be a way to link GitHub issues to blog posts, but didn’t investigate it much.

Comparisons using for loops vs split

2019-02-13 4 min read

TL;DR Sometimes for loops are useful, and sometimes they shouldn’t really be used, because they don’t really help you understand your data, and even if you try, they might still be slow(er) than other ways of doing things. Comparing Groups I have some code where I am trying to determine duplicates of a group of things. This data looks something like this: create_random_sets = function(n_sets = 1000){ set.seed(1234) sets = purrr::map(seq(5, n_sets), ~ sample(seq(1, .

Nicer PNG Graphics

2018-12-06 4 min read

TL;DR If you are getting crappy looking png images from rmarkdown html or word documents, try using type='cairo' or dev='CairoPNG' in your chunk options. PNG Graphics?? So, I write a lot of reports using rmarkdown and knitr, and have been using knitr for quite a while. My job involves doing analyses for collaborators and communicating results. Most of the time, I will generate a pdf report, and I get beautiful graphics, thanks to the eps graphics device.

Don't do PCA After Statistical Testing!

2018-09-14 3 min read

TL;DR If you do a statistical test before a dimensional reduction method like PCA, the highest source of variance is likely to be whatever you tested statistically. Wait, Why?? Let me describe the situation. You’ve done an -omics level analysis on your system of interest. You run a t-test (or ANOVA, etc) on each of the features in your data (gene, protein, metabolite, etc). Filter down to those things that were statistically significant, and then finally, you decide to look at the data using a dimensionality reduction method such as principal components analysis (PCA) so you can see what is going on.

Finding Modes Using Kernel Density Estimates

2018-07-19 2 min read

TL; DR If you have a unimodal distribution of values, you can use R’s density or Scipy’s gaussian_kde to create density estimates of the data, and then take the maxima of the density estimate to get the mode. See below for actual examples in R and Python. Mode in R First, lets do this in R. Need some values to work with. library(ggplot2) set.seed(1234) n_point <- 1000 data_df <- data.

Split - Unsplit Anti-Pattern

2018-07-17 3 min read

TL;DR If you notice yourself using split -> unsplit / rbind on two object to match items up, maybe you should be using dplyr::join_ instead. Read below for concrete examples. Motivation I have had a lot of calculations lately that involve some sort of normalization or scaling a group of related values, each group by a different factor. Lets setup an example where we will have 1e5 values in 10 groups, each group of values being normalized by their own value.

Using IRanges for Non-Integer Overlaps

2018-06-23 4 min read

TL;DR The IRanges package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need. IRanges?? IRanges is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.