[BioC] Clustering of 30,000+ genes

Thorsten Forster thorsten.forster at ed.ac.uk
Mon Sep 12 12:19:51 CEST 2011


Hi January

As previous posters have pointed out, it is normal practice to remove 
noise genes prior to measuring co-expression.

However, if you are doing research, standard practice often does not 
apply. We have been working on this computational issue for a while now 
and if you have a good reason to want to compute all pairwise 
similarities, maybe you can make use of our SPRINT package. It is 
basically applying brute computing power to run any number of 
correlations you wish.

Within SPRINT, we have implemented a parallelised version of the basic 
cor() function in addition to a few machine learning algorithms, boot(), 
apply(), and some statistical tests.

Once installed, all you need to do is load the SPRINT library and in 
your script replace cor() with pcor().

Computation time is much reduced through the parallelisation scheme, and 
we get around RAM limits by incorporating the "ff" package.


Caveats:
a) You'll probably need to use the cor() function in a separate step, I 
don't know how it fits with your CoXpress or GSCA packages.

b) Someone (with system administrator skills) needs to configure the 
SPRINT package for your high performance computing platform of choice, 
be that a multi-processor desktop or something like UK supercomputer HECToR.

c) No Windows at the moment, you are limited to a Unix-based OS (we are 
just about done with a Mac version)


You can find SPRINT on CRAN, and further information here:

http://www.r-sprint.org/

  Thorsten



On 10/09/2011 11:00, bioconductor-request at r-project.org wrote:
> Subject:
> Re: [BioC] Clustering of 30,000+ genes
> From:
> Sean Davis <sdavis2 at mail.nih.gov>
> Date:
> 09/09/2011 11:08
>
> To:
> January Weiner <january.weiner at gmail.com>
> CC:
> bioconductor at r-project.org
>
>
> Hi, January.
>
> One common way of reducing the number of features is to choose the top
> X% by variance or coefficient of variation.  A large percentage of
> genes are not even expressed in a given tissue type and another large
> percentage do not vary across a sample set.  You can use the
> genefilter package to perform such filtering.
>
> Sean
>
> On Wed, Sep 7, 2011 at 5:29 PM, January Weiner<january.weiner at gmail.com>  wrote:
>> >  Hello,
>> >
>> >  I'm struggling with co-expression analysis, and for that I would like
>> >  to try to cluster all the genes I have in my microarray set, including
>> >  those which are not differentially expressed between the study groups.
>> >  I am using CoXpress at the moment and will try my luck with GSCA as
>> >  well, but both packages seem to have been layed out for 3000 rather
>> >  than 30000 genes.
>> >
>> >  How do you do that in R? I get errors about R not being able to
>> >  allocate enough memory. Clearly, the amount of memory required to
>> >  calculate all correlations the simple way might be a bit on the large
>> >  side, but I can think of one or two tricks to get this done; I wonder
>> >  whether it has been implemented already.
>> >
>> >  Other than that -- how should I reasonably limit the number of genes
>> >  to study? i don't want to bias the outcome of the analysis by
>> >  selecting only genes that are DE, actually -- I would be very
>> >  interested in genes that  show differential co-expression, but no
>> >  differences in expression.
>> >
>> >  Kind regards,
>> >
>> >  j.
>> >
>> >  --
>> >
>> >  _______________________________________________
>> >  Bioconductor mailing list
>> >  Bioconductor at r-project.org
>> >  https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >  Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>
>
> Part 1.2
>
> Subject:
> Re: [BioC] Clustering of 30,000+ genes
> From:
> "Tim Triche, Jr." <tim.triche at gmail.com>
> Date:
> 09/09/2011 12:48
>
> To:
> Sean Davis <sdavis2 at mail.nih.gov>
> CC:
> bioconductor at r-project.org, January Weiner <january.weiner at gmail.com>
>
>
> That said, there are a number of differential coexpression papers out there
> noting that, among the remaining transcripts, calculating (shrunken or
> unshrunken) estimates of the covariance matrices can be... interesting.
>
> 'corpcor', 'glasso', 'huge', and 'WGCNA' may come in handy for the latter
> task, with WGCNA explicitly designed for finding differential coexpression.
>   The authors of one such (throwaway -- no implementation released) paper
> note that they crammed 128GB of physical RAM into the machine used for the
> analyses in the paper, but it's quite possible the authors did not realize
> that filtering could have saved them a lot of time and memory.
>
>
>
> On Fri, Sep 9, 2011 at 3:08 AM, Sean Davis<sdavis2 at mail.nih.gov>  wrote:
>
>> >  Hi, January.
>> >
>> >  One common way of reducing the number of features is to choose the top
>> >  X% by variance or coefficient of variation.  A large percentage of
>> >  genes are not even expressed in a given tissue type and another large
>> >  percentage do not vary across a sample set.  You can use the
>> >  genefilter package to perform such filtering.
>> >
>> >  Sean
>> >
>> >  On Wed, Sep 7, 2011 at 5:29 PM, January Weiner<january.weiner at gmail.com>
>> >  wrote:
>>> >  >  Hello,
>>> >  >
>>> >  >  I'm struggling with co-expression analysis, and for that I would like
>>> >  >  to try to cluster all the genes I have in my microarray set, including
>>> >  >  those which are not differentially expressed between the study groups.
>>> >  >  I am using CoXpress at the moment and will try my luck with GSCA as
>>> >  >  well, but both packages seem to have been layed out for 3000 rather
>>> >  >  than 30000 genes.
>>> >  >
>>> >  >  How do you do that in R? I get errors about R not being able to
>>> >  >  allocate enough memory. Clearly, the amount of memory required to
>>> >  >  calculate all correlations the simple way might be a bit on the large
>>> >  >  side, but I can think of one or two tricks to get this done; I wonder
>>> >  >  whether it has been implemented already.
>>> >  >
>>> >  >  Other than that -- how should I reasonably limit the number of genes
>>> >  >  to study? i don't want to bias the outcome of the analysis by
>>> >  >  selecting only genes that are DE, actually -- I would be very
>>> >  >  interested in genes that  show differential co-expression, but no
>>> >  >  differences in expression.
>>> >  >
>>> >  >  Kind regards,
>>> >  >
>>> >  >  j.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Bioconductor mailing list