# Using IRanges for Non-Integer Overlaps

I wanted to make use of IRanges awesome interval logic, but for non-integer data.

R
iranges
development
programming
Author

Robert M Flight

Published

June 23, 2018

## TL;DR

The `IRanges` package implements interval algebra, and is very fast for finding overlaps of two ranges. If you have non-integer data, multiply values by a large constant factor and round them. The constant depends on how much accuracy you need.

## IRanges??

`IRanges` is a bioconductor package for interval algebra of integer ranges. It is used extensively in the `GenomicRanges` package for finding overlaps between various genomic features. For genomic features, integers make sense, because one cannot have fractional base locations.

However, `IRanges` uses red-black trees as its data structure, which provide very fast searches of overlaps. This makes it very attractive for any problem that involves overlapping ranges.

## Motivation

My motivation comes from mass-spectrometry data, where I want to count the number of raw data points and / or the number of peaks in a large number of M/Z windows. Large here means on the order of 1,000,000 M/Z windows.

Generating the windows is not hard, but searching the list of points / peaks for which ones are within the bounds of a window takes a really long time. Long enough that I needed some other method.

## IRanges to the Rescue!

So my idea was to use `IRanges`. But there is a problem, `IRanges` is for integer ranges. How do we use this for non-integer data? Simple, multiply and round the fractional numbers to generate integers.

It turns out that multiplying our mass-spec data by `20,000` gives us differences down to the `0.00005` place, which is more than enough accuracy for the size of the windows we are interested in. If needed, `IRanges` can handle `1600 * 1e6`, but currently will crash at `1600 * 1e7`.

## How Fast Is It?

Lets actually test differences in speed by counting how many overlapping points there are.

``````library(IRanges)
library(ggplot2)

``````IRanges object with 6 ranges and 1 metadata column:
start       end     width |        mz
<integer> <integer> <integer> | <numeric>
[1]   2970182   2970182         1 |   148.509
[2]   2970183   2970183         1 |   148.509
[3]   2970184   2970184         1 |   148.509
[4]   2970186   2970186         1 |   148.509
[5]   3000526   3000526         1 |   150.026
[6]   3000527   3000527         1 |   150.026``````
``head(mz_windows)``
``````IRanges object with 6 ranges and 2 metadata columns:
start       end     width |  mz_start    mz_end
<integer> <integer> <integer> | <numeric> <numeric>
[1]   2960000   2960011        12 |       148   148.001
[2]   2960001   2960012        12 |       148   148.001
[3]   2960002   2960013        12 |       148   148.001
[4]   2960003   2960014        12 |       148   148.001
[5]   2960004   2960016        13 |       148   148.001
[6]   2960006   2960017        12 |       148   148.001``````

I have some example data with 3447542 windows, and 991816 points. We will count how many point there are in each window using the below functions, with differing number of windows.

### Functions

``````count_overlaps_naive <- function(mz_start, mz_end, points){
sum((points >= mz_start) & (points <= mz_end))
}

iterate_windows <- function(windows, points){
purrr::pmap_int(windows, count_overlaps_naive, points)
}

run_times_iterating <- function(windows, points){
t <- Sys.time()
window_counts <- iterate_windows(windows, points)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}

run_times_countoverlaps <- function(windows, points){
t <- Sys.time()
window_counts <- countOverlaps(points, windows)
t2 <- Sys.time()
run_time <- difftime(t2, t, units = "secs")
run_time
}``````

### Define Samples of Different Sizes

``````set.seed(1234)

sample_sizes <- c(10, 100, 1000, 10000, 50000, 100000)

window_samples <- purrr::map(sample_sizes, function(x){sample(length(mz_windows), size = x)})``````

### Run It

``````iranges_times <- purrr::map_dbl(window_samples, function(x){
run_times_countoverlaps(mz_windows[x], mz_points)
})

window_frame <- as.data.frame(mcols(mz_windows))

naive_times <- purrr::map_dbl(window_samples, function(x){
run_times_iterating(window_frame[x, ], mz_points)
})``````

### Plot Them

``````all_times <- data.frame(size = rep(sample_sizes, 2),
time = c(iranges_times, naive_times),
method = rep(c("iranges", "naive"), each = 6))

p <- ggplot(all_times, aes(x = log10(size), y = time, color = method)) + geom_point() + geom_line() + labs(y = "time (s)", x = "log10(# of windows)", title = "Naive & IRanges Timings") + theme(legend.position = c(0.2, 0.8))
p``````

``p + coord_cartesian(ylim = c(0, 1))``

As the two figures show, the naive solution, while a little faster under 1000 regions, is quickly outperformed by `IRanges`, whose time increases much more slowly.

## Citation

BibTeX citation:
``````@online{mflight2018,
author = {Robert M Flight},
title = {Using {IRanges} for {Non-Integer} {Overlaps}},
date = {2018-06-23},
url = {https://rmflight.github.io/posts/2018-06-23-iranges-for-non-integer-overlaps},
langid = {en}
}
``````