[BioC] DESeq and number of replicates required for RNA-Seq
michael watson (IAH-C)
michael.watson at bbsrc.ac.uk
Tue Jun 15 08:44:31 CEST 2010
Yes, I have several RNA-Seq datasets that look like they may have large biological variation.
I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount.
I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25).
And of course there is cufflinks
From: Naomi Altman [naomi at stat.psu.edu]
Sent: 15 June 2010 03:02
To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
I was working this out for a lecture and here is what I found:
If there is enough expression for the Normal approximation to hold
then here is a rule of thumb.
Suppose that the total number of reads is identical for all samples
and that there is NO biological variation. If Yi is the number of
reads for a gene in sample i, then
Poisson variation alone leads to log(Yi) approx normal with variance
1/4. (This is what the DESeq vignette calls "shot" variance.)
Using the formula for a 2 sample t-test, you see that to detect
2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
n>32 var/log(fold) which is approximately 8 biological reps per treatment.
However, that is for NO biological variation. (Have a look at the
example in the DESeq vignette!) And is assumes alpha=.05 (but we are
going to use a much smaller alpha due to the multiple comparisons
At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
>Thanks for the reply.
>The issue isn't necessarily low expressing genes, but perhaps high
>expressing genes with a small (ish) fold change. DESeq seems to
>only report as significant differences that are high fold changes.
>Contrast this to limma for microarrays, where small fold changes can
>be reported as significant.
>For whatever reason, the transcriptomic community have become
>fixated on "two-fold" as some kind of standard cut-off. Now, I'm
>not fixated on that, but the example in DESeq reports 428
>significant genes with an estimated fold change at FDR 5%, however,
>NONE of these are in the range -2 : 2. The minimum positive logFC
>is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
>2.49 (5.65 fold down-regulation).
>So what I am concerned about is finding genes, either highly or
>lowly expressed, that are differing by a small fold change - say two-fold.
>From: Naomi Altman [naomi at stat.psu.edu]
>Sent: 14 June 2010 17:42
>To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
>Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>The issue is a mix of expression level and sample size. For count
>data, the power is higher when the expression is higher. Also, the
>p-values are discrete - the lower the total read count, the fewer
>values are possible, which messes up the FDR estimation.
>Of course, understanding the problem does not necessarily suggest a
>solution. But sample sizes will need to be large (or you need to
>sequence very deeply) if you want to detect differential expression
>in low expressing genes.
>At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
> >This follows on slightly from my experimental design thread.
> >Having worked through the vignette for DESeq, it seems to work
> >well. However, for the TagSeqExample.tab data set, when using an
> >FDR cut off of 0.05, what we see is that we only find differential
> >expression for large fold changes - an average of log2 fold change
> >of 5 for up-regulated, and log2 fold change of -5 for
> >down-regulated. There are very few significant results that even go
> >as far down as 2 or -2 - which is still a 4-fold change.
> >So, the question is, how many replicates must we have to get more
> >sensitive results? Say down to log2FC of 1? (two-fold up or down
> >I can calculate this by using DESeq's own estimates of variance to
> >approximate replicates for T and N in the example data, and keep
> >going until my significant results start to hit a logFC of 1, but I
> >wanted to know if anyone else had done this yet?
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >Search the archives:
>Naomi S. Altman 814-865-3791 (voice)
>Dept. of Statistics 814-863-7114 (fax)
>Penn State University 814-865-1348 (Statistics)
>University Park, PA 16802-2111
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>Search the archives:
Naomi S. Altman 814-865-3791 (voice)
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348 (Statistics)
University Park, PA 16802-2111
More information about the Bioconductor