[Rd] dist function in R is very slow

Stefan Evert stefanML at collocations.de
Sun Jun 18 18:23:20 CEST 2017


> By the way, since the tcrossprod function in the Matrix package is so fast, the Euclidean distance can be computed very fast:

Indeed.

> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);  sqrt(outer(sq,sq,"+") - 2*mtm)}

There are two reasons why I didn't use this optimization in "wordspace":

1) It can be inaccurate for small distances between vectors of large Euclidean length because of loss of significance in the subtraction step.  This is not just a theoretical concern – I've seen data sets were this became a real problem.

2) It incurs substantial memory overhead for a large distance matrix. Your code allocates at least five matrices of this size: outer(…), mtm, 2 * mtm, outer(…) - 2*mtm, and the final result obtained by taking the square root.  [Actually, there is additional overhead for m*m (an even larger matrix) when computing the Euclidean norms, but this could be avoided with sq <- rowNorms(m, method="euclidean").]

I am usually more concerned about RAM than raw processing speed, so the package was designed to keep memory overhead as low as possible and allow users to work with realistic data sets on ordinary laptop computers.


> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 columns, while dist.matrix with method="euclidean" takes almost 10 minutes (which is still orders of magnitude faster than dist).

It's a little disappointing that dist.matrix() is still relatively slow despite all simplifications and better cache consistency (the function automatically transposes the input matrix and computes distances by columns rather than rows).  I'm a little surprised about your timing, though.  Testing with a random 5000 x 20000 matrix, my MacBook computers the full Euclidean distance matrix in about 5 minutes.  

If your machine (and version of R) supports OpenMP, you can improve performance by allowing multithreading with wordspace.openmp(threads=<n>).  In my test case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m).


Best wishes,
Stefan


More information about the R-devel mailing list