[BioC] RNAseq expression analysis using DESeq: technical replicates, paired samples

Thu Apr 15 10:59:36 CEST 2010

Dear Jinfeng

On Wed, 14 Apr 2010 22:28:17 +0000 (UTC), Jinfeng Liu <jinfengl at gene.com>
wrote:
> I'm trying to use DESeq for RNAseq expression analysis. I haven't been
> able to find information about how to deal with the following issues:
> 
> 1) technical replicates
> We have two biological samples, two libraries (of different insert size)
> were prepared for each of them. so I have four lanes of data in total
and I
> want to do differential expression between the two samples. It doesn't
look quite
> right to me to set up the condition vector as 
> 
> conds <- c( "Sample1", "Sample1","Sample2","Sample2") since they are
only 
> technical replicates, not biological. But I'm not sure what to do.

If you set up your test this way, DESeq will assume that the variance
between the replicates is all there is. Hence, roughly speaking, it will
call a difference significant if it is larger than the fluctuations
observed between the technical replicates. This then only tells you that
the gene might be typically different between different samples, but you
won't know whether the difference is really due to the difference in
treatment or whether you would have observed the same magnitude of
difference between two samples that have been treated the same way.

Of course, without biological replicates, there is no way to settle this
question properly. 

The best thing you can do is to add up the counts from each sample, and
compare just one data column with summed data from Sample 1 with one data
column for Sample 2. Call DESeq's 'estimateVarianceFunctions' function
with
the argument 'pool=TRUE', and it will ignore the sample labels and
estimate
the variance between the conditions. Hence, it will only call those genes
differentially expressed that have a much stronger difference between
conditions than the other genes of similar expression strength. You might
find only few differentially expressed genes, but these are the only ones
for which you can be somewhat sure that they are proper hits.

> 2) Paired samples
> We have samples from three patients. For each patient, we have matched
> tumor and adjacent normal samples. How should we set up the analysis to
capture the
> pair information?

Sorry, but DESeq does not support paired tests (yet). I have some ideas on
how to add this but this might take a while.

For now, your best option is to use DESeq's 'getVarianceStabilizedData' to
transform your data to a scale on which it is approximately homoskedastic.
Then, you can use a pair-wise t-test or a pair-wise z-test. (Don't do this
with the raw data, use DESeq's variance-stabilizing transcformation to
make
them homoskedastic first.)

The pairwise t-test should work out of the box with R's standard 't.test'
function. A pair-wise z-test should have more power in this setting, as,
after the variance-stabilizing transformation, you may assume that all
data
has the same variance. Estimate this variance from your genes in a pooled
fashion (ask again if you don't know how to do that) and take the median.
Divide the pair differences by the square root of this to get z scores,
then use 'pnorm' to get a p value. In my experience, this should work
reasonably well even though it may not have as much power as a proper NB
test would have.

Cheers
  Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de