[BioC] choosing explanatory variables for linear model in limma

Thu Sep 3 02:35:23 CEST 2009

Dear list members,

short version of my question: 
How can I determine, whether it improves the model quality of a linear
model (in limma), when I introduce additional explanatory variables? Is
there an equivalent to feature selection (as in machine learning) for
choosing the explanatory variables?

The complete story: 
We analyse a dataset of about ninety single channel microarray chips and
we want to search for differentially expressed genes and enriched gene
sets. The chips are annotated with information (at least 20 factors,
could be extended to 50) like the organ from which the RNA was
extracted, the experimenter that did the lab work, the labelling kit she
used and a huge amount of features describing e.g. the genotype of the
individual or different aspects of the disease.

We would like to build one linear model (resp. one design matrix) with
all of the factors of interest mentioned above as explanatory variables
in order to test various contrasts. Of course, we have to include all
the variables that we possibly want to test in the linear model. But
what about the ``technical'' factors like the ``labelling kit'' that was
used? One never might want to test a contrast using this explanatory
variable, however the net chip intensity could be influenced by a
technical factor like this. So how can I determine, if it makes sense to
include this variable? 

I am using the standard procedure as described in the limma guide: 
 designMatrix <- model.matrix(~0 + var1 + var2, data=someTable)
 fitBoth <- lmFit(eset, designMatrix)
where var1 and var2 are variables like ``diseaseOutcome'' and
``labellingKit''. 

We thought, that maybe an anova table could help us here, showing us the
influences of var1 and var2. As far as I read
(e.g. http://data.princeton.edu/R/linearModels.html) the anvoa function
can be simply applied to a lm object or can be used to compare two lm
instances. Of course, in that case it is only applied to one linear model
and not one per gene as in the limma setting. 
So, if I try anova for one or two limma fit objects (MArrayLM), R
complains that there is no applicable method and other anova variants
(like anova.lm) do not work neither. This holds as well, when I want to
do an anova for just one extracted linear model for one gene
(like anova(lmFit[1,])).

Our ultimo ratio so far is, to build a design matrix with and another
one without a certain explanatory variable. Then we would determine the
top DEGs and compare for each DEG their fitted linear models in an anova
table. Finally we could check for how many of the top DEGs the
additional variable would make a difference.  
However, this does not seem to be the golden path...or are we completely
on the wrong track?

-- 
Best regards,
Andre J. Aberer