[R] speeding up a pairwise correlation calculation

Rajarshi Guha rxg218 at psu.edu
Fri Nov 21 04:23:27 CET 2003


Hi,
  I have a data.frame with 294 columns and 211 rows. I am calculating
correlations between all pairs of columns (excluding column 1) and based
on these correlation values I delete one column from any pair that shows
a R^2 greater than a cuttoff value. (Rather than directly delete the
column all I do is store the column number, and do the deletion later)

The code I am using is:

    ndesc <- length(names(data));
    for (i in 2:(ndesc-1)) {
        for (j in (i+1):ndesc) {

            if (i %in% drop || j %in% drop) next;
            
            r2 <- cor(data[,i],data[,j]);
            r2 <- r2*r2;

            if (r2 >= r2cut) {
                rnd <- abs(rnorm(1));
                if (rnd < 0.5) { drop <- c(drop,i); }
                else { drop <- c(drop,j); }
            }
        }
    }

drop is a vector that contains columns numbers that can be skipped
data is the data.frame

For the data.frame mentioned above (279 columns, 211 rows) the
calculation takes more than 7 minutes (after which I Ctrl-C'ed the
calculation). The machine is a 1GHz Duron with 1GB RAM

The output of version is:

platform i686-pc-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status
major    1
minor    7.1
year     2003
month    06
day      16
language R

I'm not too sure why it takes *so* long (I had done a similar
calculation in Python using list operations and it took forever), but is
there any trick that could be used to make this run faster or is this
type of runtime to be expected?

Thanks,
-------------------------------------------------------------------
Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A red sign on the door of a physics professor: 
'If this sign is blue, you're going too fast.'




More information about the R-help mailing list