# remotes::install_github("moseleybioinformaticslab/icikendalltau)
# install.packages("microbenchmark")
library(ICIKendallTau)
TL;DR
Check out our {ICIKendallTau}
R package if you want access to a fast version of Kendall-tau correlation in R (Robert M. Flight and Moseley 2022b) with only a few dependencies.
Kendall Tau??
Yes, Kendall-tau correlation! It is a rank based correlation based on the number of concordant and discordant pairs of points. This graphic from Wikipedia explains it really well (Wikipedia, n.d.).
::include_graphics("Concordant_Points_Kendall_Correlation.svg") knitr
This is a really useful correlation to use if you don’t want to have to worry about how linearly related things are, just whether the points from two samples go in the same direction or not. In addition, we think there is a neat variant we can make to incorporate the presence of missing values when they are missing not at random, and we have a preprint on that that I am working on revising (Robert M. Flight, Bhatt, and Moseley 2022).
Need for Speed!
However, there is an issue with the basic Kendall-tau algorithm. It is slower than molasses going uphill in January (as my parents used to say). Especially as we increase to correlations calculated using tens of thousands of features in both x and y.
= rnorm(10000)
x_large = rnorm(10000)
y_large ::microbenchmark(cor(x_large, y_large, method = "kendall"),
microbenchmarktimes = 20)
Unit: seconds
expr min lq mean median
cor(x_large, y_large, method = "kendall") 1.086599 1.104935 1.146724 1.127946
uq max neval
1.137544 1.417044 20
That took a full second! And the size of things we frequently want to calculate for eukaryotic based transcriptomics are 2X - 4X that size, and across many, many samples.
So if we can speed it up, that would be awesome.
I will note through all of this work, that I’ve already been through this once in the development of our {ICIKendallTau}
package, so I already know the answer. However, I felt it would be useful to work through all of this again to help others who might be looking at similar types of problems. And yes, some of the below code seems silly, but they are first stabs at an implementation.
Differences of Signs
The simplest way to find the concordant and discordant pairs is to generate all the possible pair indices of the points, and then compare their directions; the same direction of a pair in both X and Y means they are concordant, and different directions means they are discordant.
And in fact, that is exactly what is happening in the C code for R’s covariance / correlation code, iterating over each pair of points (k
and n1
; snippet shown here from lines 108 - 118 of file src/library/stats/src/cov.c in R-4.3.0).
else { /* Kendall's tau */ \
for(n1=0 ; n1 < k ; n1++) \
if(!(ISNAN(xx[n1]) || ISNAN(yy[n1]))) { \
= sign(xx[k] - xx[n1]); \
xm = sign(yy[k] - yy[n1]); \
ym
\
COV_SUM_UPDATE \} \
} \
R Based, Copy All Pairs
What can we do in R that is similar? We can generate all the pairwise indices, create actual vectors, and then get the signs of the differences maybe?
= function(x, y)
reference_ici
{= length(x)
n_x = length(y)
n_y = combn(n_x, 2)
pairs = rbind(x[pairs[1, ]],
x_pairs 2, ]])
x[pairs[= rbind(y[pairs[1, ]],
y_pairs 2, ]])
y[pairs[= sign(x_pairs[1, ] - x_pairs[2, ])
x_sign = sign(y_pairs[1, ] - y_pairs[2, ])
y_sign = x_sign * y_sign
x_y_sign = sum(x_y_sign > 0)
sum_concordant = sum(x_y_sign < 0)
sum_discordant
= duplicated(x)
x_is_dup = x[x_is_dup]
x_dup = table(x_dup) + 1;
x_tied_values_t1 = duplicated(y)
y_is_dup = y[y_is_dup]
y_dup = table(y_dup) + 1
y_tied_values_t2
= sum(x_tied_values_t1 * (x_tied_values_t1 - 1)) / 2
x_tied_sum_t1 = sum(y_tied_values_t2 * (y_tied_values_t2 - 1)) / 2
y_tied_sum_t2 = n_x * (n_x - 1) / 2
t_0
= sqrt((t_0 - x_tied_sum_t1) * (t_0 - y_tied_sum_t2))
k_denominator = sum_concordant - sum_discordant
k_numerator
= k_numerator / k_denominator
k_tau
k_tau }
Let’s see how long this takes as a baseline, and how it compares to the {stats::cor}
function.
set.seed(1234)
= rnorm(1000)
x = rnorm(1000)
y ::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkreference_ici(x, y),
times = 20)
Unit: milliseconds
expr min lq mean median
cor(x, y, method = "kendall") 11.44744 12.47315 12.84324 13.07735
reference_ici(x, y) 351.74901 368.20867 392.19704 387.35215
uq max neval
13.28316 13.63687 20
417.69942 432.55782 20
Not so great. Not that surprising, given the main correlation algorithm in R is written in C. Let’s see if we can speed that up. Although our code seems vectorized here, there is significant time taken in creating the large matrices to hold the pair indices, and then create the pairs themselves. Therefore, if we can avoid the creation of those large matrices, we can probably improve the speed.
R Based, Increment Count
= function(x, y)
iterators_ici
{= length(x)
n_entry = 0
sum_concordant = 0
sum_discordant for (i in seq(1, n_entry - 1)) {
for (j in seq(i + 1, n_entry)) {
= sum_concordant + ((sign(x[i] - x[j]) * sign(y[i] - y[j])) > 0)
sum_concordant = sum_discordant + ((sign(x[i] - x[j]) * sign(y[i] - y[j])) < 0)
sum_discordant
}
}= sum_concordant - sum_discordant
k_numerator
= duplicated(x)
x_is_dup = x[x_is_dup]
x_dup = table(x_dup) + 1;
x_tied_values_t1 = duplicated(y)
y_is_dup = y[y_is_dup]
y_dup = table(y_dup) + 1
y_tied_values_t2
= sum(x_tied_values_t1 * (x_tied_values_t1 - 1)) / 2
x_tied_sum_t1 = sum(y_tied_values_t2 * (y_tied_values_t2 - 1)) / 2
y_tied_sum_t2 = n_entry * (n_entry - 1) / 2
t_0
= sqrt((t_0 - x_tied_sum_t1) * (t_0 - y_tied_sum_t2))
k_denominator = sum_concordant - sum_discordant
k_numerator
= k_numerator / k_denominator
k_tau
k_tau }
::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkreference_ici(x, y),
iterators_ici(x, y),
times = 20)
Unit: milliseconds
expr min lq mean median
cor(x, y, method = "kendall") 11.55812 12.49837 12.83441 12.69495
reference_ici(x, y) 357.29120 371.24522 387.07048 378.65192
iterators_ici(x, y) 279.28479 283.04712 289.51380 284.79317
uq max neval
13.19846 14.45556 20
403.50524 438.96896 20
290.43204 335.47143 20
Allright! We’ve got some decent improvement over our base attempt. We also know, thanks to the SciPy project, that there is a better way to do get the number of concordant and discordant pairs (which is what we use in {ICIKendallTau}
). Let’s see if we can manage that in pure R, and see if it helps us out any.
R Based, Sort
= function(x, y)
sort_ici
{
= function(values)
get_sort
{= order(values, method = "radix")
value_order
}
= function(x){
compare_self = length(x)
n_entry = vector("integer", n_entry)
match_self 1] = 1L
match_self[
= 2
idx
for (i in seq(2, n_entry)) {
if (x[i] != x[(i - 1)]) {
= 1L;
match_self[idx] else {
} = 0L;
match_self[idx]
}= idx + 1
idx
}return(match_self)
}
= function(x, y)
compare_both
{= length(x)
n_entry = vector("integer", n_entry + 1)
match_self 1] = 1L
match_self[
= 2
idx
for (i in seq(2, n_entry)) {
if ((x[i] != x[(i - 1)]) || (y[i] != y[(i - 1)])) {
= 1L
match_self[idx] else {
} = 0L
match_self[idx]
}= idx + 1
idx
}+ 1] = 1
match_self[n_entry return(match_self)
}
= function(ranks)
count_rank_tie
{
= duplicated(ranks)
dup_ranks = ranks[dup_ranks]
ranks2 = table(ranks2) + 1
number_tied
return(list(ntie = sum(number_tied * (number_tied - 1)) / 2,
t0 = sum(number_tied * (number_tied - 1) * (number_tied - 2)) / 2,
t1 = sum(number_tied * (number_tied - 1) * (2 * number_tied + 5))))
}
= function(x){
which_notzero = vector("integer", length(x))
notzero = 1L
idx
for (i in seq(1, length(x))) {
if (x[i] != 0) {
= i - 1
notzero[idx] = idx + 1L
idx
}
}= seq(1, idx - 1)
keep_loc = notzero[keep_loc]
notzero return(notzero)
}
= function(x, y){
kendall_discordant #x = x4
#y = y4
= 1 + max(y)
sup
= vector("integer", sup)
arr = 0
i = 0
k = length(x)
n = 1L
idx = 0
dis
while (i < n) {
while ((k < n) && (x[i + 1] == x[k + 1])) {
= dis + i
dis = y[k + 1]
idx while (idx != 0) {
= dis - arr[idx + 1]
dis = bitwAnd(idx, idx - 1)
idx
}= k + 1
k
}while (i < k) {
= y[i + 1]
idx while (idx < sup) {
+ 1] = arr[idx + 1] + 1
arr[idx = idx + bitwAnd(idx, (-1*idx))
idx
}= i + 1
i
}
}
dis
}
= length(x)
n_entry = get_sort(y)
perm_y = x[perm_y]
x = y[perm_y]
y = compare_self(y)
y3 = cumsum(y3)
y4
= get_sort(x);
perm_x = x[perm_x]
x = y4[perm_x]
y4 = compare_self(x)
x3 = cumsum(x3)
x4
= compare_both(x4, y4)
obs = sum(obs)
sum_obs = diff(which_notzero(obs))
cnt = kendall_discordant(x4, y4)
dis
= sum((cnt * (cnt - 1)) / 2)
ntie = count_rank_tie(x4)
x_counts = x_counts[[1]]
xtie = x_counts[[2]]
x0 = x_counts[[3]]
x1
= count_rank_tie(y4)
y_counts = y_counts[[1]]
ytie = y_counts[[2]]
y0 = y_counts[[3]]
y1
= (n_entry * (n_entry - 1)) / 2
tot = tot - xtie - ytie + ntie - 2 * dis
con_minus_dis = con_minus_dis / sqrt((tot - xtie) * (tot - ytie))
tau return(tau)
}
::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkreference_ici(x, y),
iterators_ici(x, y),
sort_ici(x, y),
times = 20)
Unit: milliseconds
expr min lq mean median
cor(x, y, method = "kendall") 11.709530 12.33871 12.63373 12.488732
reference_ici(x, y) 355.224637 367.46692 387.70370 374.809739
iterators_ici(x, y) 273.673038 276.02535 280.80801 281.919287
sort_ici(x, y) 8.760096 8.98937 13.48845 9.395173
uq max neval
12.87524 14.06242 20
421.60254 441.29727 20
283.32639 289.62743 20
10.02100 87.38422 20
So, I thought I could implement the sorting method in R, and it would help. It helps, in that it looks like it gets back to C speed, in R. Which isn’t bad, but still isn’t great. This is more than likely because I don’t actually understand the sort based Kendall-tau algorithm on a theoretical level. I was able to copy it from the Python into {Rcpp}
, and use a lot of debugging and test cases to make sure I had it implemented correctly, but there are things it is doing that I don’t understand algorithmically, which means I don’t know the best way to create an R-centric method for them. For this example, I literally just translated my previous {Rcpp}
code into R, which of course doesn’t necessarily make for fast code.
C++ Based, Differences
As a reference, I also implemented the iterating differences algorithm in {Rcpp}
. Note that this is not meant to be called by normal users of {ICIKendallTau}
, we have it there as a reference to make sure that everything else is working properly.
::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkreference_ici(x, y),
iterators_ici(x, y),
sort_ici(x, y),
:::ici_kt_pairs(x, y),
ICIKendallTautimes = 20)
Unit: milliseconds
expr min lq mean median
cor(x, y, method = "kendall") 11.829116 12.051790 12.872894 12.815848
reference_ici(x, y) 347.066153 382.616886 400.650147 395.033647
iterators_ici(x, y) 271.020774 273.111584 284.668516 274.988955
sort_ici(x, y) 8.978789 9.393873 10.498460 10.161402
ICIKendallTau:::ici_kt_pairs(x, y) 4.336266 4.493440 4.806514 4.771994
uq max neval
13.380157 14.977161 20
418.945940 465.485468 20
292.175271 342.778833 20
11.554007 13.605138 20
4.891025 6.397218 20
We can see that is faster than the C-based cor
by maybe 2X for 1000 long vectors. Given that ici_kt_pairs
has only the single iterator logic for the one correlation method, that makes sense.
C++ Based, Sort
Let’s compare the actual sort-based function that is implemented in the {ICIKendallTau}
package (which you can see in (Robert M. Flight and Moseley 2022a)).
::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkreference_ici(x, y),
iterators_ici(x, y),
sort_ici(x, y),
ici_kt(x, y),
times = 20)
Unit: microseconds
expr min lq mean median
cor(x, y, method = "kendall") 11922.472 12154.2850 12992.5719 12916.488
reference_ici(x, y) 352612.664 380621.5915 399187.5410 393977.078
iterators_ici(x, y) 266384.200 270873.1345 286840.9181 273214.937
sort_ici(x, y) 8862.455 9607.5720 10356.4879 10301.072
ici_kt(x, y) 216.761 245.6465 273.3186 265.784
uq max neval
13522.521 16085.135 20
424152.088 444856.163 20
289494.598 338773.242 20
11140.300 12458.339 20
291.693 367.443 20
That’s fast! Notice the time moved from milliseconds to microseconds, and the {ICIKendallTau}
version is ~ 50X faster than the base R version. Let’s increase the size of our vectors by 10X and compare the run times again.
= rnorm(10000)
x_large = rnorm(10000)
y_large ::microbenchmark(cor(x, y, method = "kendall"),
microbenchmarkcor(x_large, y_large, method = "kendall"),
ici_kt(x, y),
ici_kt(x_large, y_large),
times = 20)
Unit: microseconds
expr min lq
cor(x, y, method = "kendall") 11654.425 11757.9275
cor(x_large, y_large, method = "kendall") 1134471.513 1149212.6425
ici_kt(x, y) 211.174 228.8565
ici_kt(x_large, y_large) 2341.995 2370.5770
mean median uq max neval
12088.3777 11816.745 12018.914 15825.794 20
1201616.7482 1168728.874 1243587.177 1456530.477 20
259.2706 247.703 265.988 531.182 20
2643.4572 2392.539 2453.216 4024.673 20
The base R method increases time taken by 100X, but the sort based method of {ICIKendallTau}
increases only 10X. This is the advantage of the sort method, it’s complexity is \(\mathcal{O}(n\log{}n))\), vs \(\mathcal{O}(n^2)\).
References
Reuse
Citation
@online{mflight2023,
author = {Robert M Flight},
title = {Fast {Kendall-tau} {Correlation} with {Rcpp}},
date = {2023-05-29},
url = {https://rmflight.github.io/posts/2023-05-29-fast-kendalltau-correlation-with-rcpp},
langid = {en}
}