[BioC] limma design matrix

Gordon Smyth smyth at wehi.edu.au
Mon Mar 1 03:12:41 MET 2004


Dear Michael,

Thanks for your comments on limma and for your email. I definitely agree 
with you that constructing the design and contrasts matrices are the 
hardest part of using limma - the amount of time I spend advising people on 
this makes me very aware of it. You're not the only person to have asked 
about factorial models, so I'll try to address some general issues.

Firstly, let me say that the factorial example in the User's Guide is out 
of data. The analysis would be rather simpler now using newer features of 
the designMatrix() and contrast.fit() functions. I won't go through the 
newer analysis here because the Weaver experiment is a direct design while 
you have a common reference design - a bit simpler.

Secondly, why doesn't limma allow you to use factorial formula like 'data ~ 
c*b*t' to construct your design matrix? There are a number of reasons.
a) ANOVA is not as simple as it appears. In you were to use the aov() 
function in the data example that you give below, you would not get out the 
interactions and F-statistics that you expect. There just isn't enough data 
to use the factorial model approach. Even if there was more data, 
microarray data is usually unbalanced because of quality weights or missing 
data so there won't be a unique anova table.
b) The formula aov() approach doesn't apply without further development to 
direct comparison two-color designs or to designs including dye-swaps where 
the dye-swap is not a factor to be fitted.
c) The most important reason though relates to the interpretation of the 
factorial model. The division of sums of squares into mean effects and 
interactions is a generally useful technique but the interactions very 
often do not correspond to contrasts of interest in themselves. A factorial 
ANOVA analysis usually proceeds in several steps. One first looks at the 
interactions to get an idea of the complexity of the data. If there aren't 
any interactions, then one can interpret the main effects directly. If 
there are, then one has to look at the data in more detail to determine 
which specific contrasts are causing the interactions. I can't see how one 
can realistically go through this process for each of 20,000 ANOVA tables. 
I think that it is best to instead specify the contrasts of interest in 
advance. Unfortunately this means that you will need to do something 
different for each analysis - it is hard to automate it.

Thirdly, note that if you really do have a formula which corresponds to a 
model that you want to fit, you can always use it in limma. Just use

   design <- model.matrix( formula )

and then input that design matrix to lmFit().

In your three way factorial example, I really think that you may want to 
neglect the interactions between animal and infection & time, i.e., you 
want a model like

   data ~ animal + infection*time

However you need to know what all the fitted parameters mean, which you 
probably won't if you just let aov() or model.matrix() make a design matrix 
for you! Here is a suggestion. Suppose you have a targets file like this:

 > targets <- readTargets("temp.txt")
 > targets
         Cy3               Cy5
1 Reference Animal1UninfEarly
2 Reference  Animal1UninfLate
3 Reference   Animal1InfEarly
4 Reference    Animla1InfLate
5 Reference Animal2UninfEarly
6 Reference  Animal2UninfLate
7 Reference   Animal2InfEarly
8 Reference    Animal2InfLate

Now specify what you'd like the parameters in your linear model to mean:

 > parameters <- cbind(
+   UninfEarly=c(-1,0.5,0,0,0,0.5,0,0,0),
+   UninfLate=c(-1,0,0.5,0,0,0,0.5,0,0,),
+   InfEarly=c(-1,0,0,0.5,0,0,0,0.5,0),
+   InfLate=c(-1,0,0,0,0.5,0,0,0,0.5,),
+   AnimalDiff=c(0,-0.25,-0.25,-0.25,-0.25,0.25,0.25,0.25,0.25)
+ )
 > rownames(parameters) <- c("Reference",targets$Cy5)
 > parameters
                   UninfEarly UninfLate InfEarly InfLate AnimalDiff
Reference               -1.0      -1.0     -1.0    -1.0       0.00
Animal1UninfEarly        0.5       0.0      0.0     0.0      -0.25
Animal1UninfLate         0.0       0.5      0.0     0.0      -0.25
Animal1InfEarly          0.0       0.0      0.5     0.0      -0.25
Animla1InfLate           0.0       0.0      0.0     0.5      -0.25
Animal2UninfEarly        0.5       0.0      0.0     0.0       0.25
Animal2UninfLate         0.0       0.5      0.0     0.0       0.25
Animal2InfEarly          0.0       0.0      0.5     0.0       0.25
Animal2InfLate           0.0       0.0      0.0     0.5       0.25

This means you want five coefficients. The first corresponds to the average 
log-ratio of uninfected animals at the early time versus the common 
reference. The second is the average log-ratio of uninfected animals at the 
later time, etc. The fifth coefficient represents the difference between 
your two animals, a nuisance parameter probably. There are other ways to do 
this, the coefficients I've chosen are only one choice.

Now compute the design matrix. You don't need to try to understand the 
detailed entries in the design matrix to interpret the coefficients.

 > design <- designMatrix(targets,parameters)
Found unique target names:
  Reference Animal1UninfEarly Animal1UninfLate Animal1InfEarly 
Animla1InfLate Animal2UninfEarly Animal2UninfLate Animal2InfEarly 
Animal2InfLate
Warning message:
number of parameters should be one less than number of targets in: 
designMatrix(targets, parameters)
 > design
   UninfEarly UninfLate InfEarly InfLate AnimalDiff
1          1         0        0       0       -0.5
2          0         1        0       0       -0.5
3          0         0        1       0       -0.5
4          0         0        0       1       -0.5
5          1         0        0       0        0.5
6          0         1        0       0        0.5
7          0         0        1       0        0.5
8          0         0        0       1        0.5
 > fit <- lmFit(MA, design)

Now you can start asking quesions. For example, what is the effect of 
infection at the early time? What is the effect of infection at the late 
time? Does the effect of infection change between the late and early times? 
(This is usually called interaction.)

cont.matrix <- 
cbind(InfvsUninfEarly=c(-1,0,1,0,0),InfvsUninfLate=c(0,-1,0,1,0),EarlyvsLateEffect=c(1,-1,-1,1,0))
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)

Note that you need a recent version of limma for this to work.

Regards
Gordon

At 03:42 AM 28/02/2004, michael watson (IAH-C) wrote:
>Hi
>
>First off, let me say that i think limma is a quite brilliant package and 
>I use it a lot.  However, one of the biggest obstructions to using limma 
>for the lay biologist is an inability to understand the design 
>matrix.  Although there is a lot of documentation, showing the design 
>matrix for a number of example problems, there is no discussion as to how 
>that design matrix was constructed i.e. the logical thought processes that 
>went into it.
>
>For the two-sample experiment given in the UserGuide, I understand that 
>there must be one row per array in my design matrix and the columns 
>represent the coefficients I want to calculate.  I represent the 
>differences in the factors with 1's and 0's.  Great, this is pretty 
>similar to how I do it for aov().
>
>But then, all of a sudden, for the factorial experiment the design matrix 
>not only has -1's in there, but also a column for the interactions.  How 
>do I decide which array/factor combination gets a 1, a 0 or a -1?
>
>Let me put this in perspective.  I have a 3 factor experiment where the 
>factors are animal, infected/uninfected and time.  All samples were 
>hybridised against a common reference.  For analysis of variance, all I do 
>is set up a data.frame that looks like this for each gene:
>
>   data c b t
>1  2.9 1 1 1
>2  2.7 1 0 2
>3  2.8 1 1 1
>4  3.0 1 0 2
>5 -3.0 0 1 1
>6 -3.5 0 0 2
>7 -4.0 0 1 1
>8 -5.0 0 0 2
>
>where data is my data, and c, b and t are my factors, and then feed in 
>something like:
>
>(aov.aov <- aov(data ~ c*b*t, aov.data))
>
>and I get F-statistics for c, b and t and all possible interactions.
>
>Because of the limitations of analysis of variance for my microarray data, 
>I would like to use limma.  Is there any *more* documentation I can look 
>at that will tell me the steps to take to work out what my limma design 
>matrix will look like?
>
>Kind regards
>
>Michael



More information about the Bioconductor mailing list