[BioC] limma - Considering Batch Effect

Fri Aug 10 16:31:41 CEST 2007

Hi Nathan,

Nathan S. Haigh wrote:
> James MacDonald wrote:
>> Hi Nathan,
>>
>>
>> Nathan Haigh wrote:
>>> I know this topic has arisen a few times from searching the mail
>>> archive, but I'm still not clear on what/how to do this.
>>>
>>> Here is my experimental setup:
>>> Array    Param1   Param2   Rep   Batch
>>> 1        H        4        1     1
>>> 2        H        4        2     2
>>> 3        H        4        3     2
>>> 4        H        20       1     1
>>> 5        H        20       2     2
>>> 6        H        20       3     2
>>> 7        H        37       1     1
>>> 8        H        37       2     2
>>> 9        H        37       3     2
>>> 10       L        4        1     1
>>> 11       L        4        2     2
>>> 12       L        4        3     2
>>> 13       L        20       1     1
>>> 14       L        20       2     2
>>> 15       L        20       3     2
>>> 16       L        37       1     1
>>> 17       L        37       2     2
>>> 18       L        37       3     2
>>>
>>> All rep 1's were processed in batch 1 and reps 2 and 3 were processes in
>>> batch 2. From using GeneSpring, and conducting a cluster analysis on
>>> experimental parameters, there is a clear batch effect. In GeneSpring, I
>>> normalised the arrays, such that gene1 from array1 was divided by the
>>> median of gene1 over arrays from batch1. Likewise, gene1 from array2 was
>>> divided by the median of gene1 across arrays in batch2. These
>>> successfully removed the batch effect. I'd like to be able to remove the
>>> batch effect doing something similar, or accounting for it in limma.
>>>
>>> In essence, I want to remove the variability in the reps caused by the
>>> batch effect.
>>>
>>> Could anyone explain how this can/should be done?
>> Sure. Just add the Batch effect to your design matrix. For instance,
>> if you want to model the differences in Param1, you would do something
>> like:
>>
>>> design <- model.matrix(~0+factor(rep(1:2,
>> each=9))+factor(rep(rep(1:2,1:2),6)))
>>> colnames(design) <- c("H","L","Batch")
>>> design
>>    H L Batch
>> 1  1 0     0
>> 2  1 0     1
>> 3  1 0     1
>> 4  1 0     0
>> 5  1 0     1
>> 6  1 0     1
>> 7  1 0     0
>> 8  1 0     1
>> 9  1 0     1
>> 10 0 1     0
>> 11 0 1     1
>> 12 0 1     1
>> 13 0 1     0
>> 14 0 1     1
>> 15 0 1     1
>> 16 0 1     0
>> 17 0 1     1
>> 18 0 1     1
>> attr(,"assign")
>> [1] 1 1 2
>> attr(,"contrasts")
>> attr(,"contrasts")$`factor(rep(1:2, each = 9))`
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$`factor(rep(rep(1:2, 1:2), 6))`
>> [1] "contr.treatment"
>>
>> Then you can do the contrast between H and L.
>>
>> Best,
>>
>> Jim
>>
>>
>>
> 
> Hi Jim,
> 
> Thanks very much for your help! I'm still trying to get my head around
> the creation of the design matrix - it doesn't seem too intuitive to a
> newbie. Although I'm sure all will become clear as soon as the penny drops!

You really need to read the Limma User's Guide. In addition, you might 
look at a linear modeling textbook. This is a perennial subject on this 
listserv, as setting up a design matrix (especially for 2-color stuff) 
is non-trivial if you know nothing of linear modeling.

> 
> Just to clarify something about the experimental design and hopefully
> help me understand the construction of the design matrix. Param2 is a
> temperature treatment, while Param1 is a "sample type". So it makes
> sense to do contrasts between H.20 and H.4 to get cold-shock genes in H
> samples, H.20 and H.37 to get heat-shock genes in H samples. How would I
> set up the matrix for this? Also, in the context of this experimental
> design, what does a contrast between H and L actually show?

It gets pretty complicated if you set up a design matrix with both 
Param1 and Param2 and a batch effect. It would take more time and effort 
than I am willing to expend to figure out the contrasts matrix.

A simpler (if slightly less powerful) approach if you just want to 
compare the H samples is to set up your design matrix with just Param2 
and a batch effect, and then you can simply compare things directly. 
Something like

 > Param2 <- factor(rep(c(4,20,37), each=3))
 > batch <- factor(rep(rep(1:2,1:2),3))
 > design <- model.matrix(~0+Param2+batch)
 > design
   Param24 Param220 Param237 batch2
1       1        0        0      0
2       1        0        0      1
3       1        0        0      1
4       0        1        0      0
5       0        1        0      1
6       0        1        0      1
7       0        0        1      0
8       0        0        1      1
9       0        0        1      1
attr(,"assign")
[1] 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$Param2
[1] "contr.treatment"

attr(,"contrasts")$batch
[1] "contr.treatment"

Then your contrasts matrix would be

 > contrast <- makeContrasts(Param220-Param24, Param220-Param237, 
levels=design)
 > contrast
           Contrasts
Levels     Param220 - Param24 Param220 - Param237
   Param24                  -1                   0
   Param220                  1                   1
   Param237                  0                  -1
   batch2                    0                   0

The comparison of H and L that I mentioned in the previous email would 
simply tell you the difference between the H and L samples without 
regard to the Param2. In other words, you would be fitting a t-test 
comparing all H samples to all L samples (adjusted for batch).

Best,

Jim

> 
> Cheers
> Nathan
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623