# Using IRanges for Non-Integer Overlaps

## TL;DR

The `IRanges` package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.

## IRanges??

`IRanges` is a bioconductor package for interval algebra of integer ranges. It is used extensively in the `GenomicRanges` package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.

However, `IRanges` uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.

## Motivation

My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.

Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.

## IRanges to the Rescue!

So my idea was to use `IRanges`. But there is a problem, `IRanges` is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.

It turns out that multiplying our mass-spec data by `20,000` gives us differences down to the `0.00005` place, which is more than enough accuracy for the size of the windows we are interested in. If needed, `IRanges` can handle `1600 * 1e6`, but currently will crash at `1600 * 1e7`.

## How Fast Is It?

Lets actually test differences in speed by counting how many overlapping points there are.

``````library(IRanges)
library(ggplot2)
load("../../extdata/iranges_example_data.rda")

head(mz_points)``````
``````## IRanges object with 6 ranges and 1 metadata column:
##           start       end     width |               mz
##       <integer> <integer> <integer> |        <numeric>
##      2970182   2970182         1 | 148.509100524992
##      2970183   2970183         1 | 148.509161249802
##      2970184   2970184         1 |  148.50922197465
##      2970186   2970186         1 | 148.509282699535
##      3000526   3000526         1 | 150.026298638453
##      3000527   3000527         1 | 150.026360296201``````
``head(mz_windows)``
``````## IRanges object with 6 ranges and 2 metadata columns:
##           start       end     width |         mz_start           mz_end
##       <integer> <integer> <integer> |        <numeric>        <numeric>
##      2960000   2960011        12 |              148 148.000554529159
##      2960001   2960012        12 | 148.000055452916 148.000609982075
##      2960002   2960013        12 | 148.000110905832 148.000665434991
##      2960003   2960014        12 | 148.000166358748 148.000720887907
##      2960004   2960016        13 | 148.000221811664 148.000776340823
##      2960006   2960017        12 |  148.00027726458 148.000831793739``````

I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.

### Functions

``````count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}

iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}

run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}

run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}``````

### Define Samples of Different Sizes

``````set.seed(1234)

sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)

window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})``````

### Run It

``````iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})

window_frame <- as.data.frame(mcols(mz_windows))

naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})``````

### Plot Them

``````all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))

p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p`````` ``p + ylim(c(0, 1))`` As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by `IRanges`, whose time increases much more slowly.