[BioC] Computing large correlations in R

Paul [guest] guest at bioconductor.org
Mon Sep 30 08:57:07 CEST 2013


I have two list of lists A and B, A and B contain 100 data frames each and the dimension of each data frame is 15000 X 15000. I would like to find the correlation for the entire data frame in the following way: Consider the first list in both lists and find cor (A,B) and get a single value correlating the entire dataframe. Similarly consider the second list in both lists and find cor(A,B) and continue this for the 100 dataframes.



I tried the following:

      A # list of 100 dataframes
      B #list of 100 dataframes

      C<- A[1] # extract only the first list from A
      D<- B[1] # extract only the first list from B

      C<-unlist(C) ### unlist C
      D<-unlist(D) ## unlist D

Then computed
      
       Correlation<- cor(C,D) ## to obtain a single correlation coefficient to see how these two vectors are correlated         


But I end up with the error sayin 

      R cannot allocate a vector of size 3.9 GB


Is there a better way to do this in faster way which could be implemented to the entire list. I work on a server which allows me to compute large values but it still shows up this error and the unlisting takes ages because of the size of the dataframe.

   

 -- output of sessionInfo(): 

R version 3.0.1 (2013-05-16)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list