[R] speeding up a pairwise correlation calculation

Fri Nov 21 06:54:35 CET 2003

You probably want to use runif() instead of rnorm() for equal
probability of selecting between i,j

Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2, ... ], so
it should not be too slow. But two for() loops are inefficient in R.
Something like this should be fairly fast in C.

What is you aim in trying to do this ? Your algorithm is similar to
hclust() - which has nice graphical support - but it merges two nearest
neighbour to find another centroid instead of removing one of the
neigbours. By removing columns early in stage you are losing
information. 

The alternative would be to use hclust(), select a
similarity/dissimilarity cutoff to create groups. Then from each group
you can either choose the average profile or randomly select one column
to represent the group.

--
Adaikalavan Ramasamy 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rajarshi Guha
Sent: Friday, November 21, 2003 11:23 AM
To: R
Subject: [R] speeding up a pairwise correlation calculation

Hi,
  I have a data.frame with 294 columns and 211 rows. I am calculating
correlations between all pairs of columns (excluding column 1) and based
on these correlation values I delete one column from any pair that shows
a R^2 greater than a cuttoff value. (Rather than directly delete the
column all I do is store the column number, and do the deletion later)

The code I am using is:

    ndesc <- length(names(data));
    for (i in 2:(ndesc-1)) {
        for (j in (i+1):ndesc) {

            if (i %in% drop || j %in% drop) next;

            r2 <- cor(data[,i],data[,j]);
            r2 <- r2*r2;

            if (r2 >= r2cut) {
                rnd <- abs(rnorm(1));
                if (rnd < 0.5) { drop <- c(drop,i); }
                else { drop <- c(drop,j); }
            }
        }
    }

drop is a vector that contains columns numbers that can be skipped data
is the data.frame

For the data.frame mentioned above (279 columns, 211 rows) the
calculation takes more than 7 minutes (after which I Ctrl-C'ed the
calculation). The machine is a 1GHz Duron with 1GB RAM

The output of version is:

platform i686-pc-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status
major    1
minor    7.1
year     2003
month    06
day      16
language R

I'm not too sure why it takes *so* long (I had done a similar
calculation in Python using list operations and it took forever), but is
there any trick that could be used to make this run faster or is this
type of runtime to be expected?

Thanks,
-------------------------------------------------------------------
Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A red sign on the door of a physics professor: 
'If this sign is blue, you're going too fast.'

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help