[BioC] Select different linear models in voom

Gordon K Smyth smyth at wehi.EDU.AU
Sat Mar 8 07:14:22 CET 2014

On Fri, 7 Mar 2014, Francesco Gatto wrote:

> Thanks for your reply. I can certainly run the regression on the 400 
> covariates.

I assume this means you can run voom(), lmFit() and eBayes().

> However, in my understanding, backward selection still requires ~2^400 
> steps

It takes at most 400 steps.

There are at most 400 covariate to remove.  At each step you remove the 
covariate that produces least DE amongst those remaining.  At each step, 
one call to summary(decideTests()) will identify which one to remove.

> (I honestly doubt that covariates have an independent effect on the 
> response variable).

Independence is not being assumed.

> Is there a more efficient way to operate? And when you suggest to remove 
> a covariate if it results in no DE genes, do you mean that the 
> coefficient of the corresponding factor is not significantly different 
> from 0 for all the genes?

Yes, but you can obviously choose a less stringent criteria for removal if 
that suits your problem.  It's your data -- I've just made a suggestion.


> Best regards,
> /Francesco
> On 7 mar 2014, at 00:29, Gordon K Smyth wrote:
> Dear Francesco,
> If you have 400 covariates and 1000 samples it would appear that you can feasibly use all covariates in a linear model at once.  Does voom() work on your computer with this full model or does R run out of memory?
> It it works, then I would suggest running voom and limma on the full model as usual, then removing covariates one by one from the linear model (without re-running voom) if they result in no DE genes.  In model selection theory, this is called "backward selection".
> Best wishes
> Gordon
> Date: Thu,  6 Mar 2014 00:45:55 -0800 (PST)
> From: "Francesco [guest]"
> To: bioconductor at r-project.org, gatto at chalmers.se
> Subject: [BioC] Select different linear models in voom
> I have recently implemented the approach used in voom to estimate the mean and the variance of each log-cpm at the observational level. My dataset contains ~1000 samples, that features a discrete amount of metadata that may be used as covariates (~400). This allows, in principle, for a better construction of the linear model on which both the fitted mean and the fitted variance are estimated in voom, by simply including more factors.
> So far, I have used the AIC weights to test the probability for various linear models to be more likely to explain the data than the alternative models. Of course, testing all possible combinations of linear models is computationally infeasible (in principle, 2^400). However, even if I detected most gene are well explained by a simple LM, a non negligible fraction of them depend on additional factors.
> The point is the what makes the expression profile of a certain gene interesting, is when the covariates play an important role in determining its mean and variance. Therefore I am reluctant to use the simple LM because this would eliminate all the covariates. On the other hand, I am reluctant to use to more complicated LM because it clearly unnecessarily fits a large amount of genes.
> What is the best way to proceed?
> Thanks!
> -- output of sessionInfo():
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> locale:
> [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] edgeR_3.4.2  limma_3.18.9
> loaded via a namespace (and not attached):
> [1] tools_3.0.2
> --

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list