I wanted to make use of IRanges awesome interval logic, but for non-integer data.
The IRanges
package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.
IRanges
is a bioconductor package for interval algebra of integer ranges. It is used extensively in the GenomicRanges
package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.
However, IRanges
uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.
My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.
Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.
So my idea was to use IRanges
. But there is a problem, IRanges
is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.
It turns out that multiplying our mass-spec data by 20,000
gives us differences down to the 0.00005
place, which is more than enough accuracy for the size of the windows we are interested in. If needed, IRanges
can handle 1600 * 1e6
, but currently will crash at 1600 * 1e7
.
Lets actually test differences in speed by counting how many overlapping points there are.
IRanges object with 6 ranges and 1 metadata column:
start end width | mz
<integer> <integer> <integer> | <numeric>
[1] 2970182 2970182 1 | 148.509
[2] 2970183 2970183 1 | 148.509
[3] 2970184 2970184 1 | 148.509
[4] 2970186 2970186 1 | 148.509
[5] 3000526 3000526 1 | 150.026
[6] 3000527 3000527 1 | 150.026
head(mz_windows)
IRanges object with 6 ranges and 2 metadata columns:
start end width | mz_start mz_end
<integer> <integer> <integer> | <numeric> <numeric>
[1] 2960000 2960011 12 | 148 148.001
[2] 2960001 2960012 12 | 148 148.001
[3] 2960002 2960013 12 | 148 148.001
[4] 2960003 2960014 12 | 148 148.001
[5] 2960004 2960016 13 | 148 148.001
[6] 2960006 2960017 12 | 148 148.001
I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.
count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}
iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}
run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})
window_frame <- as.data.frame(mcols(mz_windows))
naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})
all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))
p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p
As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by IRanges
, whose time increases much more slowly.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rmflight/researchBlog_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Flight (2018, June 23). Deciphering Life: One Bit at a Time: Using IRanges for Non-Integer Overlaps. Retrieved from https://rmflight.github.io/posts/2018-06-23-iranges-for-non-integer-overlaps/
BibTeX citation
@misc{flight2018using, author = {Flight, Robert M}, title = {Deciphering Life: One Bit at a Time: Using IRanges for Non-Integer Overlaps}, url = {https://rmflight.github.io/posts/2018-06-23-iranges-for-non-integer-overlaps/}, year = {2018} }