[R] Very slow: using double apply and cor.test to compute correlation p.values for 2 matrices

hadley wickham h.wickham at gmail.com
Wed Nov 26 16:33:59 CET 2008


On Wed, Nov 26, 2008 at 8:14 AM, jim holtman <jholtman at gmail.com> wrote:
> Your time is being taken up in cor.test because you are calling it
> 100,000 times.  So grin and bear it with the amount of work you are
> asking it to do.
>
> Here I am only calling it 100 time:
>
>> m1 <- matrix(rnorm(10000), ncol=100)
>> m2 <- matrix(rnorm(10000), ncol=100)
>> Rprof('/tempxx.txt')
>> system.time(cor.pvalues <- apply(m1, 1, function(x) { apply(m2, 1, function(y) { cor.test(x,y)$p.value }) }))
>   user  system elapsed
>   8.86    0.00    8.89
>>
>
> so my guess is that calling it 100,000 times will take:  100,000 *
> 0.0886 seconds or about 3 hours.

You can make it ~3 times faster by vectorising the testing:

m1 <- matrix(rnorm(10000), ncol=100)
m2 <- matrix(rnorm(10000), ncol=100)

system.time(cor.pvalues <- apply(m1, 1, function(x) { apply(m2, 1,
function(y) { cor.test(x,y)$p.value })}))


system.time({
r <- apply(m1, 1, function(x) { apply(m2, 1, function(y) { cor(x,y) })})

df <- nrow(m1) - 2
t <- sqrt(df) * r / sqrt(1 - r ^ 2)
p <- pt(t, df)
p <- 2 * pmin(p, 1 - p)
})


all.equal(cor.pvalues, p)


You can make cor much faster by stripping away all the error checking
code and calling the internal c function  directly (suggested by the
Rprof output):


system.time({
r <- apply(m1, 1, function(x) { apply(m2, 1, function(y) { cor(x,y) })})
})

system.time({
r2 <- apply(m1, 1, function(x) { apply(m2, 1, function(y) {
.Internal(cor(x, y, 4L, FALSE)) })})
})

1.5s vs 0.2 s on my computer.  Combining both changes gives me a ~25
time speed up - I suspect you can do even better if you think about
what calculations are being duplicated in the computation of the
correlations.

Hadley

-- 
http://had.co.nz/



More information about the R-help mailing list