[BioC] limma: paired analysis & effects on/interpretation of coefficients

Fri Aug 26 15:22:14 CEST 2011

Hi Guido,

On 8/25/2011 4:20 PM, Hooiveld, Guido wrote:
> Hi,
>
> I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel.
>
> The design of the experiment is as follows:
> - affymetrix arrays, RMA normalized
> - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1.
> - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment.
> - the model includes both 'Treatment' and 'Subject'
> - in the contrast matrix (only) 'Treatment' is defined as contrast of interest.
> - limma runs without any problems.
>
> So far, so good.
>
> However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is  -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation.
>
> Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject.
>
> Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted?

I think you are misinterpreting the coefficients in your model. You 
don't give the entire design matrix, but I doubt that you have enough 
data to fit the model the way you think you are. An example:

 > library(limma)
 > treat <- factor(rep(1:3, 6))
 > subj <- factor(rep(1:6, each=3))
 > design <- model.matrix(~0+treat+subj)
 > design
    treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6
1       1      0      0     0     0     0     0     0
2       0      1      0     0     0     0     0     0
3       0      0      1     0     0     0     0     0
4       1      0      0     1     0     0     0     0
5       0      1      0     1     0     0     0     0
6       0      0      1     1     0     0     0     0
7       1      0      0     0     1     0     0     0
8       0      1      0     0     1     0     0     0
9       0      0      1     0     1     0     0     0
10      1      0      0     0     0     1     0     0
11      0      1      0     0     0     1     0     0
12      0      0      1     0     0     1     0     0
13      1      0      0     0     0     0     1     0
14      0      1      0     0     0     0     1     0
15      0      0      1     0     0     0     1     0
16      1      0      0     0     0     0     0     1
17      0      1      0     0     0     0     0     1
18      0      0      1     0     0     0     0     1

Note that there is no column for subj1. This is because the model would 
be over-specified if there were one, and the coefficients couldn't be 
estimated. Now, for any row you are computing a coefficient for 
whichever column has a 1. So if you look at the first row (which is 
subject 1 who got treatment 1), there is only one 1, for treat1.

This means that treat1 is the coefficient for subj1treat1. Now jump to 
row 4. This is subject 2 treatment 1. We are estimating a coefficient 
for treat1 (subj1treat1) and subj2. Because this row corresponds to 
subject 2 treatment 1, this implies that the coefficient 'subj2' is 
estimating the difference between subj2 and subj1.

This is algebraically the same as doing a paired t-test. In the paired 
t-test you estimate the treatment effect by subtracting within subjects, 
and then summing that difference. By doing so, you are eliminating the 
inter-subject differences.

Here we are estimating the treatment effect for subj1 and subtracting 
off the differences between subj1 and all the other subjects. In the end 
we have a treatment effect that is corrected for any intrinsic 
differences between subjects. Essentially the same thing.

Now if you compare treat1 to treat2, you get the difference between 
treatments, controlled for inter-subject differences, just like in the 
paired t-test.

So long story short, you aren't getting the same estimated coefficient 
because you aren't summing the right things.

> Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean?

The coefficients for each subject are simply the difference between that 
subject and subject 1. This is usually considered to be a nuisance 
parameter, as it has no bearing on the experiment at hand, but has to be 
estimated because your results will be biased otherwise.

Best,

Jim

>
> Any insights are appreciated, :)
> Guido
>
>> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg")
>> x.norm<- rma(affy.data)
>>
>> targets<- readTargets("targets_FL.txt")
>> Treatment<- as.factor(targets$Treatment)
>> Subject<- as.factor(targets$Subject)
>>
>> design<- model.matrix(~0+Treatment + Subject)
>>
>> fit1<- lmFit(x.norm, design)
>>
>> cont.matrix<- makeContrasts(
> + #compareT2 vs T1 (=TreatmentT2-Treatment T1)
> + T2vsT1=(TreatmentT2-TreatmentT1),
> + T3avsT2=(TreatmentT3a-TreatmentT2),
> + T3bvsT2=(TreatmentT3b-TreatmentT2),
> + levels=design)
>>
>>
>> fit2<- contrasts.fit(fit1, cont.matrix)
>> fit2<- eBayes(fit2)
>> topTable(fit2,coef=1)
>               ID      logFC   AveExpr          t      P.Value    adj.P.Val
> 5076     230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13
> 2396  129804_at  0.5968911  7.007159  11.576944 2.140830e-17 2.100368e-13
> 19071   9415_at -1.6726710  9.445871 -11.450699 3.457049e-17 2.261141e-13
> 7193  285754_at -1.0808572  7.133550 -10.945659 2.395265e-16 1.174997e-12
> 11167     51_at -0.5467982  9.673153 -10.853833 3.416245e-16 1.340671e-12
> 9286    3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12
> 13983   6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11
> 9494    4017_at -0.5091157  9.288021 -10.198771 4.414266e-15 1.082709e-11
> 8582     368_at -0.5483289  7.093417 -10.059173 7.658057e-15 1.669627e-11
> 13660    595_at -0.5011958  9.767599  -9.970447 1.087936e-14 2.134748e-11
>               B
> 5076  30.42098
> 2396  29.09107
> 19071 28.63062
> 7193  26.76836
> 11167 26.42637
> 9286  25.04037
> 13983 24.10246
> 9494  23.95835
> 8582  23.42630
> 13660 23.08709
>
>
> ###########
> ## Output fit1
> ###########
>
>> fit1<- lmFit(x.norm, design)
>> fit1
> An object of class "MArrayLM"
> $coefficients
>               TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b    Subject7
> 100009676_at    7.084174    7.101001     7.029655     7.090152 -0.21206182
> 10000_at        9.295713    9.338025     9.339492     9.359302 -0.06561657
> 10001_at        7.604867    7.636235     7.655008     7.665815  0.02216949
> 10002_at        4.375210    4.299118     4.393576     4.374674 -0.01616572
> 100033413_at    4.492079    4.453446     4.413175     4.522281 -0.39793930
>                  Subject12    Subject20    Subject24   Subject25     Subject26
> 100009676_at -0.120245077 -0.051913214  0.152116448 -0.30216929 -0.0817054427
> 10000_at      0.006550643 -0.095030864 -0.006489267 -0.13069982  0.0515787097
> <snip>
>
> $design
>    TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12
> 1           1           0            0            0        0         0
> 2           0           0            1            0        0         0
> 3           1           0            0            0        0         0
> 4           0           1            0            0        0         0
> 5           0           0            1            0        0         0
>    Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32
> 1         0         0         0         0         0         0         0
> 2         0         0         0         0         0         0         0
> 3         0         0         0         0         1         0         0
> 4         0         0         0         0         1         0         0
> 5         0         0         0         0         1         0         0
>    Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51
> 1         0         0         0         0         0         0         0
> 2         0         0         0         0         0         0         0
> 3         0         0         0         0         0         0         0
> 4         0         0         0         0         0         0         0
> 5         0         0         0         0         0         0         0
>    Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71
> 1         0         0         0         0         0         1         0
> 2         0         0         0         0         0         1         0
> 3         0         0         0         0         0         0         0
> 4         0         0         0         0         0         0         0
> 5         0         0         0         0         0         0         0
> <snip>
>
> #######
> ## contrast matrix
> #######
>> cont.matrix
>                Contrasts
> Levels         T2vsT1 T3avsT2 T3bvsT2
>    TreatmentT1      -1       0       0
>    TreatmentT2       1      -1      -1
>    TreatmentT3a      0       1       0
>    TreatmentT3b      0       0       1
>    Subject7          0       0       0
>    Subject12         0       0       0
>    Subject20         0       0       0
>    Subject24         0       0       0
>    Subject25         0       0       0
>    Subject26         0       0       0
> <snip>
>
>> sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] genefilter_1.32.0               annaffy_1.22.0
>   [3] KEGG.db_2.4.5                   GO.db_2.4.5
>   [5] RSQLite_0.9-4                   DBI_0.2-5
>   [7] AnnotationDbi_1.12.1            qvalue_1.26.0
>   [9] multtest_2.8.0                  limma_3.6.9
> [11] gcrma_2.23.0-1                  hugene11stv1hsentrezgcdf_13.0.0
> [13] affy_1.29.1                     Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.19.2         annotate_1.28.0       Biostrings_2.18.0
>   [4] IRanges_1.8.9         MASS_7.3-8            preprocessCore_1.12.0
> [7] splines_2.12.0        survival_2.35-8       tcltk_2.12.0
> [10] tools_2.12.0          xtable_1.5-6
>>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues