[BioC] DESeq and number of replicates required for RNA-Seq

Tue Jun 15 08:44:31 CEST 2010

Thanks Naomi

Yes, I have several RNA-Seq datasets that look like they may have large biological variation.

I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount.

I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25).

And of course there is cufflinks

Thanks
Mick
________________________________________
From: Naomi Altman [naomi at stat.psu.edu]
Sent: 15 June 2010 03:02
To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq

Hi Michael,
I was working this out for a lecture and here is what I found:

If there is enough expression for the Normal approximation to hold
then here is a rule of thumb.

Suppose that the total number of reads is identical for all samples
and that there is NO biological variation.  If Yi is the number of
reads for a gene in sample i, then
Poisson variation alone leads to log(Yi) approx normal with variance
1/4.  (This is what the DESeq vignette calls "shot" variance.)

Using the formula for a 2 sample t-test, you see that to detect
2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
n>32 var/log(fold) which is approximately 8 biological reps per treatment.

However, that is for NO biological variation.  (Have a look at the
example in the DESeq vignette!) And is assumes alpha=.05 (but we are
going to use a much smaller alpha due to the multiple comparisons
adjustment).

--Naomi

At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
>Hi Naomi
>
>Thanks for the reply.
>
>The issue isn't necessarily low expressing genes, but perhaps high
>expressing genes with a small (ish) fold change.  DESeq seems to
>only report as significant differences that are high fold changes.
>
>Contrast this to limma for microarrays, where small fold changes can
>be reported as significant.
>
>For whatever reason, the transcriptomic community have become
>fixated on "two-fold" as some kind of standard cut-off.  Now, I'm
>not fixated on that, but the example in DESeq reports 428
>significant genes with an estimated fold change at FDR 5%, however,
>NONE of these are in the range -2 : 2.  The minimum positive logFC
>is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
>2.49 (5.65 fold down-regulation).
>
>So what I am concerned about is finding genes, either highly or
>lowly expressed, that are differing by a small fold change - say two-fold.
>
>Thanks
>Mick
>________________________________________
>From: Naomi Altman [naomi at stat.psu.edu]
>Sent: 14 June 2010 17:42
>To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
>Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>
>The issue is a mix of expression level and sample size.  For count
>data, the power is higher when the expression is higher.  Also, the
>p-values are discrete - the lower the total read count, the fewer
>values are possible, which messes up the FDR estimation.
>
>Of course, understanding the problem does not necessarily suggest a
>solution.  But sample sizes will need to be large (or you need to
>sequence very deeply) if you want to detect differential expression
>in low expressing genes.
>
>--Naomi
>
>At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
> >Hi
> >
> >This follows on slightly from my experimental design thread.
> >
> >Having worked through the vignette for DESeq, it seems to work
> >well.  However, for the TagSeqExample.tab data set, when using an
> >FDR cut off of 0.05, what we see is that we only find differential
> >expression for large fold changes - an average of log2 fold change
> >of 5 for up-regulated, and log2 fold change of -5 for
> >down-regulated.  There are very few significant results that even go
> >as far down as 2 or -2 - which is still a 4-fold change.
> >
> >So, the question is, how many replicates must we have to get more
> >sensitive results?  Say down to log2FC of 1? (two-fold up or down
> regulated)?
> >
> >I can calculate this by using DESeq's own estimates of variance to
> >approximate replicates for T and N in the example data, and keep
> >going until my significant results start to hit a logFC of 1, but I
> >wanted to know if anyone else had done this yet?
> >
> >Thanks
> >Mick
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>Naomi S. Altman                                814-865-3791 (voice)
>Associate Professor
>Dept. of Statistics                              814-863-7114 (fax)
>Penn State University                         814-865-1348 (Statistics)
>University Park, PA 16802-2111
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111