Using IRanges for Non-Integer Overlaps

I wanted to make use of IRanges awesome interval logic, but for non-integer data.

Robert M Flight
2018-06-23

TL;DR

The IRanges package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.

IRanges??

IRanges is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.

However, IRanges uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.

Motivation

My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.

Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.

IRanges to the Rescue!

So my idea was to use IRanges. But there is a problem, IRanges is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.

It turns out that multiplying our mass-spec data by 20,000 gives us differences down to the 0.00005 place, which is more than enough accuracy for the size of the windows we are interested in. If needed, IRanges can handle 1600 * 1e6, but currently will crash at 1600 * 1e7.

How Fast Is It?

Lets actually test differences in speed by counting how many overlapping points there are.

library(IRanges)
library(ggplot2)

IRanges object with 6 ranges and 1 metadata column:
start       end     width |        mz
<integer> <integer> <integer> | <numeric>
   2970182   2970182         1 |   148.509
   2970183   2970183         1 |   148.509
   2970184   2970184         1 |   148.509
   2970186   2970186         1 |   148.509
   3000526   3000526         1 |   150.026
   3000527   3000527         1 |   150.026
IRanges object with 6 ranges and 2 metadata columns:
start       end     width |  mz_start    mz_end
<integer> <integer> <integer> | <numeric> <numeric>
   2960000   2960011        12 |       148   148.001
   2960001   2960012        12 |       148   148.001
   2960002   2960013        12 |       148   148.001
   2960003   2960014        12 |       148   148.001
   2960004   2960016        13 |       148   148.001
   2960006   2960017        12 |       148   148.001

I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.

Functions

count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}

iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}

run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}

run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}

Define Samples of Different Sizes

set.seed(1234)

sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)

window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})

Run It

iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})

window_frame <- as.data.frame(mcols(mz_windows))

naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})

Plot Them

all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))

p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p p + ylim(c(0, 1)) As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by IRanges, whose time increases much more slowly.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".