TL;DR
The IRanges
package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.
IRanges??
IRanges
is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges
package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.
However, IRanges
uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.
Motivation
My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.
Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.
IRanges to the Rescue!
So my idea was to use IRanges
. But there is a problem, IRanges
is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.
It turns out that multiplying our mass-spec data by 20,000
gives us differences down to the 0.00005
place, which is more than enough accuracy for the size of the windows we are interested in. If needed, IRanges
can handle 1600 * 1e6
, but currently will crash at 1600 * 1e7
.
How Fast Is It?
Lets actually test differences in speed by counting how many overlapping points there are.
library(IRanges)
library(ggplot2)
load("../../extdata/iranges_example_data.rda")
head(mz_points)
## IRanges object with 6 ranges and 1 metadata column:
## start end width | mz
## <integer> <integer> <integer> | <numeric>
## [1] 2970182 2970182 1 | 148.509100524992
## [2] 2970183 2970183 1 | 148.509161249802
## [3] 2970184 2970184 1 | 148.50922197465
## [4] 2970186 2970186 1 | 148.509282699535
## [5] 3000526 3000526 1 | 150.026298638453
## [6] 3000527 3000527 1 | 150.026360296201
head(mz_windows)
## IRanges object with 6 ranges and 2 metadata columns:
## start end width | mz_start mz_end
## <integer> <integer> <integer> | <numeric> <numeric>
## [1] 2960000 2960011 12 | 148 148.000554529159
## [2] 2960001 2960012 12 | 148.000055452916 148.000609982075
## [3] 2960002 2960013 12 | 148.000110905832 148.000665434991
## [4] 2960003 2960014 12 | 148.000166358748 148.000720887907
## [5] 2960004 2960016 13 | 148.000221811664 148.000776340823
## [6] 2960006 2960017 12 | 148.00027726458 148.000831793739
I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.
Functions
count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}
iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}
run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
Define Samples of Different Sizes
set.seed(1234)
sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)
window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})
Run It
iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})
window_frame <- as.data.frame(mcols(mz_windows))
naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})
Plot Them
all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))
p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p
p + ylim(c(0, 1))
As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by IRanges
, whose time increases much more slowly.