## TL;DR

The `IRanges`

package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a **large** constant factor and round them. The constant depends on how much accuracy you need.

## IRanges??

`IRanges`

is a bioconductor package for interval algebra of **i**nteger **ranges**. It is used extensively in the `GenomicRanges`

package for finding overlaps between various genomic features. For genomic features, **integers** make sense, because one cannot have fractional base locations.

However, `IRanges`

uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for **any** problem that involves overlapping ranges.

## Motivation

My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a **large** number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.

Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes **a really long time**. Long enough that I needed some other method.

## IRanges to the Rescue!

So my idea was to use `IRanges`

. But there is a problem, `IRanges`

is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.

It turns out that multiplying our mass-spec data by `20,000`

gives us differences down to the `0.00005`

place, which is more than enough accuracy for the size of the windows we are interested in. If needed, `IRanges`

can handle `1600 * 1e6`

, but currently will crash at `1600 * 1e7`

.

## How Fast Is It?

Lets actually test differences in speed by counting how many overlapping points there are.

```
library(IRanges)
library(ggplot2)
load("../../data/iranges_example_data.rda")
head(mz_points)
```

```
## IRanges object with 6 ranges and 1 metadata column:
## start end width | mz
## <integer> <integer> <integer> | <numeric>
## [1] 2970182 2970182 1 | 148.509100524992
## [2] 2970183 2970183 1 | 148.509161249802
## [3] 2970184 2970184 1 | 148.50922197465
## [4] 2970186 2970186 1 | 148.509282699535
## [5] 3000526 3000526 1 | 150.026298638453
## [6] 3000527 3000527 1 | 150.026360296201
```

`head(mz_windows)`

```
## IRanges object with 6 ranges and 2 metadata columns:
## start end width | mz_start mz_end
## <integer> <integer> <integer> | <numeric> <numeric>
## [1] 2960000 2960011 12 | 148 148.000554529159
## [2] 2960001 2960012 12 | 148.000055452916 148.000609982075
## [3] 2960002 2960013 12 | 148.000110905832 148.000665434991
## [4] 2960003 2960014 12 | 148.000166358748 148.000720887907
## [5] 2960004 2960016 13 | 148.000221811664 148.000776340823
## [6] 2960006 2960017 12 | 148.00027726458 148.000831793739
```

I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.

### Functions

```
count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}
iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}
run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}
```

### Define Samples of Different Sizes

```
set.seed(1234)
sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)
window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})
```

### Run It

```
iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})
window_frame <- as.data.frame(mcols(mz_windows))
naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})
```

### Plot Them

```
all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))
p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p
```

`p + ylim(c(0, 1))`

As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by `IRanges`

, whose time increases much more slowly.