[BioC] How to determine if clinical variables are responsible for gene expression with limma

Mon Apr 27 14:52:48 CEST 2009

Hello,

I'm new to the Bioconductor list, and fairly new to Bioconductor itself, 
so excuse me if the following is a stupid question- I've been looking 
around the list and documentation for a while without finding my answer.

The short version of my question is "What is the most appropriate way to 
determine if microarray-derived gene expression is associated with any 
of a number of continuous and discrete clinical variables, independent 
of patient group/ treatment type?".

The long version, with my attempt at this analysis is as follows:

I'm currently analysing a single-channel Agilent microarray data set 
involving 29 patients in three clinical groups. I've been using limma, 
and think I've got the methods right for comparing those groups, like:

clinical_group <- 
c(3,2,1,1,2,3,3,1,2,1,2,3,1,3,1,2,2,1,1,3,2,3,2,1,1,2,3,2,3)
design <- model.matrix(~ 0+factor(clinical_group))
colnames(design) <- c("one", "two", "three")
fit <- lmFit(esetPROC, design)

comparisons <- c("one-three", "one-two", "three-two")
contrast.matrix <- makeContrasts(contrasts=comparisons, levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

....... where esetPROC is an expression set object containing normalised 
and corrected expression values.

However, I also have a number of continuous and discrete clinical 
variables associated with these patients. I'm interested in seeing if 
any of these variables are associated with high or low gene expression. 
Referring to this thread...

http://thread.gmane.org/gmane.science.biology.informatics.conductor/11402/focus=11409

... I attempted to do this with a design in limma in the following manner:

design <- model.matrix(~ 0+var1+var2+var3)
fit <- lmFit(esetPROC, design)
fit2 <- eBayes(fit)

, where var1 etc are continuous clinical variables. When using all the 
variables, I get very few probes significantly associated with the 
variables. However, if I employ only one variable at a time, all 
variables (even non-sensical variables such as the day of the month a 
patient was born) seem to produce hundreds or thousands of probes with 
significant adjusted p-values. I assume this is because I'm 
mis-understanding fundamentally something that's going on here (I'm not 
a mathematician), and mis-applying the method.

I'd appreciate any pointers as regards where I'm going wrong here- and 
where my misconceptions may lie.

Regards,

Jon Manning

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.