[BioC] Select different linear models in voom

Gordon K Smyth smyth at wehi.EDU.AU
Fri Mar 7 00:29:19 CET 2014

Dear Francesco,

If you have 400 covariates and 1000 samples it would appear that you can 
feasibly use all covariates in a linear model at once.  Does voom() work 
on your computer with this full model or does R run out of memory?

It it works, then I would suggest running voom and limma on the full model 
as usual, then removing covariates one by one from the linear model 
(without re-running voom) if they result in no DE genes.  In model 
selection theory, this is called "backward selection".

Best wishes

> Date: Thu,  6 Mar 2014 00:45:55 -0800 (PST)
> From: "Francesco [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, gatto at chalmers.se
> Subject: [BioC] Select different linear models in voom
> I have recently implemented the approach used in voom to estimate the 
> mean and the variance of each log-cpm at the observational level. My 
> dataset contains ~1000 samples, that features a discrete amount of 
> metadata that may be used as covariates (~400). This allows, in 
> principle, for a better construction of the linear model on which both 
> the fitted mean and the fitted variance are estimated in voom, by simply 
> including more factors.
> So far, I have used the AIC weights to test the probability for various 
> linear models to be more likely to explain the data than the alternative 
> models. Of course, testing all possible combinations of linear models is 
> computationally infeasible (in principle, 2^400). However, even if I 
> detected most gene are well explained by a simple LM, a non negligible 
> fraction of them depend on additional factors.
> The point is the what makes the expression profile of a certain gene 
> interesting, is when the covariates play an important role in 
> determining its mean and variance. Therefore I am reluctant to use the 
> simple LM because this would eliminate all the covariates. On the other 
> hand, I am reluctant to use to more complicated LM because it clearly 
> unnecessarily fits a large amount of genes.
> What is the best way to proceed?
> Thanks!
> -- output of sessionInfo():
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> locale:
> [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] edgeR_3.4.2  limma_3.18.9
> loaded via a namespace (and not attached):
> [1] tools_3.0.2
> --

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list