[BioC] Linear modeling for affy experiment
YUK FAI LEUNG
yfleung at mcb.harvard.edu
Thu Jun 10 23:06:28 CEST 2004
I am going to do an affy experiment for the first time. I have a few
questions about the linear model design for my experiment.
I have a 3x2 factorial experiment. Three biological samples (wild type
whole animal (WA), wild type tissue (WT), mutant tissue (MT)) and two
time points (t1 and t2). The effects of interest are the mutant (M) and
tissue (T) specific expression and their changes other time (Ti).
I suppose the model should have the following equations (error term
omitted) and I would have 3 affy biological replicates for each condition.
WA.t1 = mu
WT.t1 = mu + T
MT.t1 = mu + T + M + T*M
WA.t2 = mu + Ti
WT.t2 = mu + T + Ti + T*Ti
MT.t2 = mu + T + M + Ti + T*M + M*Ti + T*Ti + T*M*Ti
My questions are:
1. How many degree of freedom do I have in the model? How do I calculate
degree of freedom in linear model in general? For my case, is it 3
arrays * 6 conditions - 8 coefficients to be estimated = 10 degree of
2. If I want to increase my degree of freedom, is it true that I can do
it by increasing my replicate? If it is true, is there a difference
between repeating a sample with more coeffcients (e.g. MT.t2) and a
sample with less coefficients (e.g. WA.t1)? It seems to me having a
repeat with more coefficients is better off, but I don't know have to
stay it out statistically.
3. What is the formal way to determine whether an interaction term is
meaningful/significant in the model or not? Is it by the p-value? And
should I remove the term and fit the model (& again) if it is not
significant and deemed not important by biological knowledge? Or should
I just fit the full model once and go ahead to interpret the contrasts
of interest? Is there a formal way (e.g. the diagnostics people use to
assess ANOVA models) for evaluating the quality of the whole fitted
model? Or I need not worry about this at all?
5. I have some confusion about the multiple hypothesis testing
adjustment for many contrasts. (I know I should better only use the
p-values/B/moderated t for ranking genes, but I am just curious to
know). For example in limma one would extract the contrast of interest
and list the candidate genes out on Toptable with the option = FDR etc.,
but isn't it true that this is just the adjustment for that estimate?
When I evaluate all possible contrasts, how can I adjust the multiple
hypothesis testing for the genes in all the contrasts that I have made?
6. A minor question. What does M & A in the Toptable of a
coefficient/contrast mean for affy data? If A stands the log2 intensity
estimate for that coefficient/contrast, is M the log2 ratio of (mu +
(coefficient or contrast estimate))/mu?
Thanks a lot for answering my questions. Any other advice for my design
is also welcome.
Yuk Fai Leung
Department of Molecular and Cellular Biology
BL 2079, 16 Divinity Avenue
Cambridge, MA 02138
email: yfleung at mcb.harvard.edu; yfleung at genomicshome.com
More information about the Bioconductor