[BioC] yet another question on technical replicates...

James W. MacDonald jmacdon at med.umich.edu
Fri Apr 14 14:58:29 CEST 2006

Hi Elena,

Giorgi, Elena wrote:
> Dear Board,
> I know this topic has been discussed several times, yet I'm still
> confused on how to come up with the right design matrix when technical
> replicates are present, especially when dealing with affy arrays. 
> For one thing, I know that when we have the same number of tech reps per
> biological sample, then we can proceed and use the duplicate correlation
> function, correct?
> On the other hand, when this is not the case, what's the best strategy
> to use? Averaging is not recommended, yet if we have 2-3 arrays per
> sample, we don't have enough degrees of freedom to be able to include
> the technical replication effect, isn't this so?
> One example that came up in our lab was an affy experiment with two
> cell-lines; for group1 we had 5 arrays, one biological replicate each,
> and for group2 we had 4 arrays, 2 tech reps from one sample and 2 tech
> reps from a different sample. 
> We used the following design matrix:
> 1 0 0
> 1 0 0
> 1 0 0
> 1 0 0
> 1 0 0
> 0 1 0
> 0 1 0
> 0 0 1
> 0 0 1
> And, in order to test the differences between the two groups, the
> following contrast: c(-1, 0.5, 0.5).
> Does this sound like a reasonable approach? In general, should we
> include a different column in the design matrix for each tech rep group
> and average the contrast coefficients accordingly? Or is this just
> equivalent to averaging the tech reps?

It is pretty much equivalent to averaging the tech reps. The denominator 
of the t-statistic you will be computing may be slightly different, but 
overall I don't think there will be much difference.

With these data you are going to have to violate some assumptions in 
order to analyze them the way you want. When you fit a linear model to 
these data without using the batch argument and calculating the 
intra-batch correlation you are assuming that all the samples are 
independent (among other things), which is obviously not true since some 
are technical replicates. This will likely result in a variance estimate 
that is smaller than it should be, which may result in more 
'significant' genes than you should really see.

The other alternative is to average the technical replicates from the 
start and then fit the model. Again, the variance estimate will be off 
because in the case of the tech replicates you will be calculating based 
on means, which are much less variable than the underlying data. As 
above, you will likely have more significant genes than if you weren't 
violating assumptions.

In Statistics we sometimes have to fit a model knowing that we are 
violating one or more of the underlying assumptions. The trick is to 
know that you are violating assumptions, and to understand what that 
means for your results.



> Thanks so much,
> Elena
> "EMF <COH.ORG>" made the following annotations.
> ------------------------------------------------------------------------------
> SECURITY/CONFIDENTIALITY WARNING:  This message and any atta...{{dropped}}
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

James W. MacDonald, M.S.
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109

Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

More information about the Bioconductor mailing list