[BioC] choosing explanatory variables for linear model in limma

Thu Sep 3 15:33:07 CEST 2009

Hi Andre,

If you want to do model selection, then limma is probably not the tool 
for the job.

Instead, what I would do would be to choose some (one, five, ten, 
whatever) genes and use lm() for the model selection process. That way 
you can do all the conventional model selection steps, and once you are 
satisfied with the model you have chosen you can go back to limma and 
fit the model on all the genes.

Best,

Jim

Andre J. Aberer wrote:
> Dear list members,
> 
> short version of my question: 
> How can I determine, whether it improves the model quality of a linear
> model (in limma), when I introduce additional explanatory variables? Is
> there an equivalent to feature selection (as in machine learning) for
> choosing the explanatory variables?
>  
> The complete story: 
> We analyse a dataset of about ninety single channel microarray chips and
> we want to search for differentially expressed genes and enriched gene
> sets. The chips are annotated with information (at least 20 factors,
> could be extended to 50) like the organ from which the RNA was
> extracted, the experimenter that did the lab work, the labelling kit she
> used and a huge amount of features describing e.g. the genotype of the
> individual or different aspects of the disease.
> 
> We would like to build one linear model (resp. one design matrix) with
> all of the factors of interest mentioned above as explanatory variables
> in order to test various contrasts. Of course, we have to include all
> the variables that we possibly want to test in the linear model. But
> what about the ``technical'' factors like the ``labelling kit'' that was
> used? One never might want to test a contrast using this explanatory
> variable, however the net chip intensity could be influenced by a
> technical factor like this. So how can I determine, if it makes sense to
> include this variable? 
> 
> I am using the standard procedure as described in the limma guide: 
>  designMatrix <- model.matrix(~0 + var1 + var2, data=someTable)
>  fitBoth <- lmFit(eset, designMatrix)
> where var1 and var2 are variables like ``diseaseOutcome'' and
> ``labellingKit''. 
> 
> We thought, that maybe an anova table could help us here, showing us the
> influences of var1 and var2. As far as I read
> (e.g. http://data.princeton.edu/R/linearModels.html) the anvoa function
> can be simply applied to a lm object or can be used to compare two lm
> instances. Of course, in that case it is only applied to one linear model
> and not one per gene as in the limma setting. 
> So, if I try anova for one or two limma fit objects (MArrayLM), R
> complains that there is no applicable method and other anova variants
> (like anova.lm) do not work neither. This holds as well, when I want to
> do an anova for just one extracted linear model for one gene
> (like anova(lmFit[1,])).
> 
> Our ultimo ratio so far is, to build a design matrix with and another
> one without a certain explanatory variable. Then we would determine the
> top DEGs and compare for each DEG their fitted linear models in an anova
> table. Finally we could check for how many of the top DEGs the
> additional variable would make a difference.  
> However, this does not seem to be the golden path...or are we completely
> on the wrong track?
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826