Using IRanges for Non-Integer Overlaps

TL;DR

The IRanges package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.

IRanges??

IRanges is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.

However, IRanges uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.

Motivation

My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.

Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.

IRanges to the Rescue!

So my idea was to use IRanges. But there is a problem, IRanges is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.

It turns out that multiplying our mass-spec data by 20,000 gives us differences down to the 0.00005 place, which is more than enough accuracy for the size of the windows we are interested in. If needed, IRanges can handle 1600 * 1e6, but currently will crash at 1600 * 1e7.

How Fast Is It?

Lets actually test differences in speed by counting how many overlapping points there are.

library(IRanges)
library(ggplot2)
load("../../extdata/iranges_example_data.rda")

head(mz_points)
## IRanges object with 6 ranges and 1 metadata column:
##           start       end     width |               mz
##       <integer> <integer> <integer> |        <numeric>
##   [1]   2970182   2970182         1 | 148.509100524992
##   [2]   2970183   2970183         1 | 148.509161249802
##   [3]   2970184   2970184         1 |  148.50922197465
##   [4]   2970186   2970186         1 | 148.509282699535
##   [5]   3000526   3000526         1 | 150.026298638453
##   [6]   3000527   3000527         1 | 150.026360296201
head(mz_windows)
## IRanges object with 6 ranges and 2 metadata columns:
##           start       end     width |         mz_start           mz_end
##       <integer> <integer> <integer> |        <numeric>        <numeric>
##   [1]   2960000   2960011        12 |              148 148.000554529159
##   [2]   2960001   2960012        12 | 148.000055452916 148.000609982075
##   [3]   2960002   2960013        12 | 148.000110905832 148.000665434991
##   [4]   2960003   2960014        12 | 148.000166358748 148.000720887907
##   [5]   2960004   2960016        13 | 148.000221811664 148.000776340823
##   [6]   2960006   2960017        12 |  148.00027726458 148.000831793739

I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.

Functions

count_overlaps_naive <- function(mz_start, mz_end, points){
  sum((points >= mz_start) & (points <= mz_end))
}

iterate_windows <- function(windows, points){
  purrr::pmap_int(windows, count_overlaps_naive, points)
}

run_times_iterating <- function(windows, points){
  t <- Sys.time()
  window_counts <- iterate_windows(windows, points)
  t2 <- Sys.time()
  run_time <- difftime(t2, t, units = "secs")
  run_time
}

run_times_countoverlaps <- function(windows, points){
  t <- Sys.time()
  window_counts <- countOverlaps(points, windows)
  t2 <- Sys.time()
  run_time <- difftime(t2, t, units = "secs")
  run_time
}

Define Samples of Different Sizes

set.seed(1234)

sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)

window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})

Run It

iranges_times <- purrr::map_dbl(window_samples, function(x){
  run_times_countoverlaps(mz_windows[x], mz_points)
})

window_frame <- as.data.frame(mcols(mz_windows))

naive_times <- purrr::map_dbl(window_samples, function(x){
  run_times_iterating(window_frame[x, ], mz_points)
})

Plot Them

all_times <- data.frame(size = rep(sample_sizes, 2),
                        time = c(iranges_times, naive_times),
                        method = rep(c("iranges", "naive"), each = 6))

p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p

p + ylim(c(0, 1))

As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by IRanges, whose time increases much more slowly.

Related