[BioC] Limma User's Guide Example of design matrices

Thu Apr 27 17:51:49 CEST 2006

Mike White wrote:
> I am working my way through the Limma User's Guide and had a question  
> about the design matrices for the example in section 8.4 (2 groups,  
> same reference).
> I understand the difference between the two design matrices in terms  
> of what you can extract directly from the linear model and what has  
> to be obtained by contrasts and how you directly construct the  
> matrices using cbind as in the manual. I have two questions, one of  
> which may trivial (i.e., stupid), and the other not. I will preface  
> this by admitting that my knowledge of statistics beyond the very  
> basics is relatively weak.
> 
> The non-trivial question:
> 
> I realize that more than one design matrix can be set up to analyze  
> the same set of data (as in the example), and that similar results  
> should be obtainable with each design. If you are eventually  
> obtaining the same information from each design (i.e., identifying  
> differentially expressed genes) what is the benefit of one design  
> over the other- could one design produce a different level of  
> statistical confidence that a given set of genes is differentially  
> regulated? Is there any rule of thumb for choosing one design matrix  
> over another?

The results will be the same for any reasonably specified design matrix. 
However, what the resulting parameter estimates are estimating and how 
you make comparisons will be different. Really, the only rule of thumb 
that I know is to use whatever design matrix makes the most sense to you.

For instance, I almost always use a cell means model (design matrix 
without an intercept term). The downside of doing that is you cannot 
make any comparisons without specifying contrasts (which you might be 
able to do with a factor effects model, where there is an intercept). 
The upside for me is that I don't have to figure out each time which 
level is being used as the baseline.

As an example, using the two design matrices below, the first model is a 
factor effects model where WT is used as the baseline, so the second 
coefficient gives the difference between MU and WT. For this you don't 
need a contrast, and for this simple comparison it is probably easier. 
If you had two factors and were interested in the interaction, then you 
would have to do the algebra to figure out the contrasts.

The second model simply computes the mean for each factor level, (hence, 
cell means model) so you have to explicitly compute the contrast of 
interest. However, in this case it would be easier (IMO) to figure out 
an interaction if you have two factors.

> 
> The trivial (?) question
> 
> I set up the two types of design matrices using the factor Group and  
> the model.matrix function as in the manual:
> 
>  > Group-> factor(c("WT","WT","MU","MU","MU"),levels=c("WT","MU"))
>  > Group
> [1] WT WT MU MU MU
> Levels: WT MU
>  > design-> model.matrix(~Group)
>  > design
>    (Intercept) GroupMU
> 1           1       0
> 2           1       0
> 3           1       1
> 4           1       1
> 5           1       1
> attr(,"assign")
> [1] 0 1
> attr(,"contrasts")
> attr(,"contrasts")$Group
> [1] "contr.treatment"
> 
>  > design2-> model.matrix(~0+Group)
>  > design2
>    GroupWT GroupMU
> 1       1       0
> 2       1       0
> 3       0       1
> 4       0       1
> 5       0       1
> attr(,"assign")
> [1] 1 1
> attr(,"contrasts")
> attr(,"contrasts")$Group
> [1] "contr.treatment"
> 
> 
> I have not been able to find a clear explanation of what the tilde  
> (~)  does in model.matrix to produce the design matrix, especially in  
> the context of "~0+Group." Any idea as to where  I can get an  
> explanation of how this works? (The 2445-page R manual wasn't any  
> help!).

The tilde is used to specify a model, separating the right hand side 
(explanatory variables) from the left hand side (dependent variable). So 
if you were fitting a model as above, but for just one gene, you would 
do something like

lm(gene_expression_values ~ Group)

However, when you are using model.matrix, you are only specifying the 
right hand side of that equation (e.g., the design matrix), so you just 
use the tilde followed by your explanatory variables.

As for '~ 0 + Group' versus '~ Group', the first instance means that you 
don't want an intercept term, whereas the second means you do (as that 
is the default).

For a more complete explanation, see ?formula.

Best,

Jim

> 
> Thanks for you help!
> 
> Mike White
> 
> 
> 
> Michael M. White, Ph.D.
> Department of Pharmacology & Physiology
> MS #488
> Drexel University College of Medicine
> 245 N. 15th Street
> Philadelphia, PA 19102-1192
> 
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.