[R] cor() alternative for huge data set

Thu Sep 30 18:53:42 CEST 2010

Hi Jyotasana,

if I understand your aim correctly, you want to find correlated sets
(clusters) of genes, and then find those clusters that are
differentially expressed? You can do that with WGCNA, or you can just
use the projectiveKMeans for splitting your probes into blocks and
then feed each block into coXpress. The correlations of probes in
different blocks will be very small and can be considered zero.

Peter

On Thu, Sep 30, 2010 at 5:05 AM, Jyotasana Gulati <jgulati at ice.mpg.de> wrote:
> Peter, Many thank for suggesting me this package. I very much believe that this will help me. But I was trying to correlate all probes(correlation between entities not variables) to calculate differentially coexpressed gene sets using package coXpress in R. I could not reduce the number on the basis of intensity, since most of the genes are down regulated and upregulated in treated conditions, so they are of my interest and cannot be removed from control samples(since I have to compare both).
>
> can you further suggest me an alternative for differentially coexpression analysis, since this is what I need to know the most-- the sets which are behaving differently across conditions.
>
> Has any one ever used this package--coXpress??
>
> Regards
> ..
> Jyotasana
> ----- Original Message -----
> From: "Peter Langfelder" <peter.langfelder at gmail.com>
> To: "Jyotasana Gulati" <jgulati at ice.mpg.de>
> Cc: r-help at r-project.org
> Sent: Thursday, September 30, 2010 4:05:44 AM
> Subject: Re: [R] cor() alternative for huge data set
>
> On Wed, Sep 29, 2010 at 1:27 PM, Jyotasana Gulati <jgulati at ice.mpg.de> wrote:
>> Hi,
>>
>> I am have a data set of around 43000 probes(rows), and have to calculate correlation matrix. When I run cor function in R, its throwing an error message of RAM shortage which was obvious for such huge number of rows.  I am not getting a logical way to cut off this huge number of entities, is there an alternative to pearson correlation or with other dist() methods calculation(euclidean) that can be run on such a huge data set??
>> Every help will be appreciated.
>
> Hmm... Are you calculating a correlation of 43000 probes, or of some
> number of samples across 43000 probes? If the former, read below. If
> the latter, I'm surprised you are running out of memory. Issuing
> garbage collection (gc()) before the calculation, closing all other
> programs, removing all other large objects from the R workspace etc.
> may help.
>
> If you really need the 43k times 43k correlation matrix of your 43k
> probes, read on.
> [Disclosure: this is a shameless plug for the package WGCNA (Weighted
> Gene Co-expression Network Analysis, also known as Weighted
> Correlation Network Analysis), from the package author, namely me.]
>
> First, since the distance matrix will be huge, you will not gain using
> other distance methods either.
>
> Second, depending on what you want to do with the 43k probes, the
> package WGCNA may help you. It has methods for creating correlation
> networks among a large number of probes. The idea is to pre-cluster
> the probes using what I call projective K-means, function
> projectiveKMeans. The pre-clustering will return what we call blocks
> of probes (or genes). We assume (this is a big assumption) that
> correlations among probes belonging to different blocks can be
> neglected. Then we treat each block separately for network
> construction (or, in your case, possibly simple calculation of
> correlation).
>
> Although this isn't strictly an R topic but rather microarray analysis
> issue, in my experience it is often useful to filter out probes before
> actually calculating and interpreting large correlation matrices. In
> conjunction with filtering, it can be advantageous to only keep one
> probe per gene (presumably there is more than one probe per gene in
> you data set). The filtering criterion varies from analysis to
> analysis, but if your data represent intensities, it is often a good
> idea to throw away probes whose intensity is always low, because such
> signals are mostly noise.
>
> If you decide to check out WGCNA, look at
> http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/.
>
> Peter
>