[BioC] Efficiently running DEXSeq for Large Cohorts

Tue Jan 15 23:25:13 CET 2013

Dear Fong Chun Chan,

If you want to source the code, you need to attach the libraries 
required directly to your environment. Before calling the TRT functions, 
just do:

library(statmod)

I think that should work,

Best regards,
Alejandro

> Hi Alejandro,
>
> Thanks for the reply. I am trying to use the TRT functions from the 
> DEXSeq svn repository in Bioconductor. When I tried to run:
>
> source('/bioconductor.svn/DEXSeq/R/TRT.R')
> estimateDispersionsTRT( exonCounSet, nCores = 48 )
>
> I got a series of errors complaining about "glmnb.fit":
>
> In addition: Warning message:
> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L,  :
>   Unable to estimate dispersions for ENSG00000187634:ENSE00001631320
> Error : could not find function "glmnb.fit"
> In addition: Warning message:
> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L,  :
>   Unable to estimate dispersions for ENSG00000187634:ENSE00001884257
> Error : could not find function "glmnb.fit"
> ....
>
> According to this post, 
> http://seqanswers.com/forums/showthread.php?t=16040, glmnb.fit is 
> supposed to come with statmod. According to my sessionInfo(), I should 
> have the latest versions:
>
> > sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils datasets  methods
> [8] base
>
> other attached packages:
> [1] optparse_1.0.0     DEXSeq_1.4.0 Biobase_2.18.0     BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.14.0 getopt_1.19.0  hwriter_1.3  RCurl_1.95-3   
> statmod_1.4.16
> [6] stringr_0.6.2  tools_2.15.2   XML_3.95-0.1
> >
>
> Am I doing something wrong here? Thanks for your help,
>
> Fong
>
>
> On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes 
> <alejandro.reyes at embl.de <mailto:alejandro.reyes at embl.de>> wrote:
>
>     Sorry, there was an error in part of my code:
>
>     The "TRT" would be like this,
>
>     data(pasillaExons, package="pasilla")
>
>     pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
>     pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT )
>     pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT )
>     pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT )
>
>     Bests,
>     Alejandro
>
>
>         Dear Fong Chun Chan,
>
>         Thank you for your interest in DEXSeq and sorry in advance for
>         the long e-mail. We have also noticed that the computing time
>         increases considerably when you have a large number of
>         samples, conditions or number of exons of a gene. For users in
>         these situations, we have implemented a variant of this
>         functions (estimateDispersionsTRT and testForDEUTRT) in the
>         most recent versions of DEXSeq in the svn.
>
>         The difference relies on how the model matrix is prepared, in
>         the "normal" functions, the model matrices used to fit the
>         glms are prepared for each exon, such that each exon bin is
>         treated individually, independently of which exon you are
>         testing. For example, if you have a gene with 5 exons, when
>         testing for exon E001, you would consider independently E002,
>         E003, ... , E005 in the model.
>
>         In the "TRT" implementation the same model matrix is used for
>         all the exons. In the same example as before, you would
>         consider E001 and the sum of all the rest exons of the same
>         gene. This reduces the model and allows to use DEXSeq with a
>         large number of samples. For more clarity, you could try to
>         compare the normal model frame of a gene with the TRT model frame:
>
>         data(pasillaExons, package="pasilla")
>         modelFrameForGene(pasillaExons, "FBgn0000256")
>         # vs
>         modelFrameForTRT( pasillaExons )
>
>         Using the same example, in the last model frame, "this" would
>         be the "E001" and "others" would be the sum of E002 + E003 +
>         ... + E005.
>
>         This would be the "normal" DEXSeq analysis:
>
>         pasillaExons <- estimateSizeFactors( pasillaExons )
>         pasillaExons <- estimateDispersions( pasillaExons )
>         pasillaExons <- fitDispersionFunction( pasillaExons )
>         pasillaExons <- testForDEU( pasillaExons )
>
>         This would be the "TRT",
>
>         pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
>         pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons )
>         pasillaExonsTRT <- fitDispersionFunction( pasillaExons )
>         pasillaExonsTRT <- testForDEUTRT( pasillaExons )
>
>         And you can see that you get the same results:
>
>         plot(fData(pasillaExons)$pvalue,
>         fData(pasillaExonsTRT)$pvalue, log="xy")
>
>         I have the "TRT" tried this for large cohorts with complex
>         models and it works nicely and in reasonable computing times.
>
>         Best regards,
>         Alejandro Reyes
>
>         ps. this changes need to be added to the vignette.
>
>
>             Hi all,
>
>             I've been trying to get DEXSeq to run on a fairly large
>             RNA-seq cohort that
>             I have. To be specific, I have 89 samples and I am attempt
>             to generate DE
>             exon usage results on > 500,000 exons.
>
>             I've followed the latest tutorial (1.5.6) on Bioconductor
>             and it so far
>             I've had relatively no problems. It just the two steps
>             that are mentioned,
>             estimateDispersions and testForDEU, are taking a fairly
>             long time. I've
>             already attempted to parallelize this on a 48-core 256GB
>             machine, but I get
>             very little progress on the run-time of these functions.
>
>             I was just wondering if anyone has a good way of running
>             DEXSeq on such a
>             large cohort. Tips on how to reduce run time? Are there
>             way to parallelize
>             these jobs across a cluster rather than rely on a single
>             machine with
>             multi-cores? Any help would be greatly appreciated.
>
>             Thanks,
>
>             Fong
>
>                 [[alternative HTML version deleted]]
>
>             _______________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>             Search the archives:
>             http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>