[Rd] dist function in R is very slow
Moshe Olshansky
m_olshansky at yahoo.com
Sat Jun 17 08:47:56 CEST 2017
Dear R developers,
I am visualising high dimensional genomic data and for this purpose I need to compute pairwise distances between many points in a high-dimensional space (say I have a matrix of 5,000 rows and 20,000 columns, so the result is a 5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many hours (I am doing this on a Linux server with more than 100 GB of RAM, so this is not the problem). When I write the matrix to disk, read it ans compute the distances in C, write them to the disk and read them into R it takes 10 - 15 minutes (and I did not spend much time on optimising my C code).The question is why the R function is so slow? I understand that it calls C (or C++) to compute the distance. My suspicion is that the transposed matrix is passed to C and so each time a distance between two columns of a matrix is computed, and since C stores matrices by rows it is very inefficient and causes many cache misses (my first C implementation was like this and I had to stop the run after an hour when it failed to complete).If my suspicion is correct, is it possible to re-write the dist function so that it works faster on large matrices?
Best regards,Moshe OlshanskyMonash University
[[alternative HTML version deleted]]
More information about the R-devel
mailing list