[R] [ExternalEmail] Pearson Correlation Speed

Charles C. Berry cberry at tajo.ucsd.edu
Mon Dec 15 19:00:26 CET 2008


On Mon, 15 Dec 2008, Nathan S. Watson-Haigh wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Nathan S. Watson-Haigh wrote:
>> I'm trying to calculate Pearson correlation coefficients for a large
>> matrix of size 18563 x 18563. The following function takes about XX
>> minutes to complete, and I'd like to do this calculation about 15 times
>> and so speed is some what of an issue.

I think you are on the wrong track, Nathan.

The matrix you are starting with is 18563 x 18563 and the result of 
finding the correlations amongst the columns of that matrix is also 18563 
x 18563. It will require more than 5 Gigabytes of memory to store the 
result and the original matrix.

Likely the time needed to do the calc is inflated because of caching 
issues and if your machine has less than enough memory to store the 
result and all the intermediate pieces by swapping as well.

You can finesse these by breaking your problem into smaller pieces, say 
computing the correlations between each pair of 19 blocks of columns 
(columns 1:977, 977+1:977, ... 18*977+1:977 ), then assembling the 
results.

---

BTW, R already has the necessary machinery to calculate the crossproduct 
matrix (etc) needed to find the correlations. You can access the low level 
linear algebra that R uses. You can marry R to an optimized BLAS if you 
like.

So pulling in some other code to do this will not save you anything. If 
you ever do decide to import C[++] code there is excellent documentation 
in the Writing R Extensions manual, which you should review before 
attempting to import C++ code into R.

HTH,

Chuck


>
> Sorry, meant to fill in the blanks!
>
> the following takes about 15 mins to complete:
> corr <- abs(cor(dat, use="p"))
>
>>
>> Does anyone have any suggestions on ways to speed this up? I'd wondered
>> if using C++ code to do the calculations might speed things up, but I've
>> never written any C/C++ code or attempted to use any within R.
>>
>> I've seen some C++ code here:
>> http://www.alglib.net/statistics/correlation.php
>>
>> I wondered if anyone might be able to help me get this so it can run in
>> R? I've tried the following:
>> 1) download and unzipped
>> http://www.alglib.net/translator/dl/statistics.correlation.cpp.zip
>> 2) moved the contents of the libs dir into the parent dir alongside
>> correlation.cpp (didn't know how to tell R where to look for C libraries)
>> 3) Tried: "R CMD SHLIB correlation.cpp" and got the following as output:
>> -- start output --
>> icpc -I/tools/R/2.7.1/lib/R/include  -I/usr/local/include  -mp  -fpic
>> -g -O2 -c correlation.cpp -o correlation.o
>> ap.h(163): warning #858: type qualifier on return type is meaningless
>>   const bool operator==(const complex& lhs, const complex& rhs);
>>              ^
>>
>> ap.h(164): warning #858: type qualifier on return type is meaningless
>>   const bool operator!=(const complex& lhs, const complex& rhs);
>>              ^
>>
>> ap.h(179): warning #858: type qualifier on return type is meaningless
>>   const double abscomplex(const complex &z);
>>                ^
>>
>> icpc -shared -L/usr/local/lib -o correlation.so correlation.o
>> -- end output --
>> 4) Now this doesn't look brilliant! Any thoughts? Also, I'm assuming I
>> need to do some other work with the C++ code in order to allow me to use
>> it from within my R scripts - any pointers on that?
>>
>> Thanks for any input - I hope I just need a hand over the initial
>> hurdles and then I can get onto that up-hill learning curve!!
>>
>> Nathan
>>
>>
>
>

[snip]


Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list