[BioC] DESeq and number of replicates required for RNA-Seq
naomi at stat.psu.edu
Wed Jun 16 14:53:51 CEST 2010
I was very surprised at the level of biological variation in the data
sets I looked at. The question is:
How can the biological variation in RNA-seq data appear to be so much
higher than in microarray data?
If the variation is artificially low in microarray data, then we have
more false positives than we think. If the variation is artificially
high in RNA-seq data, then it must be
due to technical variation which ought to show up in the analysis of
RNA samples split into several lanes on the sequencer.
At 02:44 AM 6/15/2010, michael watson (IAH-C) wrote:
>Yes, I have several RNA-Seq datasets that look like they may have
>large biological variation.
>I feel this is the "dirty secret" of the new revolution that is
>RNA-Seq - even with large numbers of replicates, the variation in
>(and nature of) the read counts means we can only find genes that
>are changing by a large amount.
>I wonder if some of the normalisation suggested by Robinson and
>Oshlack will help (http://genomebiology.com/2010/11/3/R25).
>And of course there is cufflinks
>From: Naomi Altman [naomi at stat.psu.edu]
>Sent: 15 June 2010 03:02
>To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
>Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>I was working this out for a lecture and here is what I found:
>If there is enough expression for the Normal approximation to hold
>then here is a rule of thumb.
>Suppose that the total number of reads is identical for all samples
>and that there is NO biological variation. If Yi is the number of
>reads for a gene in sample i, then
>Poisson variation alone leads to log(Yi) approx normal with variance
>1/4. (This is what the DESeq vignette calls "shot" variance.)
>Using the formula for a 2 sample t-test, you see that to detect
>2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
>n>32 var/log(fold) which is approximately 8 biological reps per treatment.
>However, that is for NO biological variation. (Have a look at the
>example in the DESeq vignette!) And is assumes alpha=.05 (but we are
>going to use a much smaller alpha due to the multiple comparisons
>At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
> >Hi Naomi
> >Thanks for the reply.
> >The issue isn't necessarily low expressing genes, but perhaps high
> >expressing genes with a small (ish) fold change. DESeq seems to
> >only report as significant differences that are high fold changes.
> >Contrast this to limma for microarrays, where small fold changes can
> >be reported as significant.
> >For whatever reason, the transcriptomic community have become
> >fixated on "two-fold" as some kind of standard cut-off. Now, I'm
> >not fixated on that, but the example in DESeq reports 428
> >significant genes with an estimated fold change at FDR 5%, however,
> >NONE of these are in the range -2 : 2. The minimum positive logFC
> >is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
> >2.49 (5.65 fold down-regulation).
> >So what I am concerned about is finding genes, either highly or
> >lowly expressed, that are differing by a small fold change - say two-fold.
> >From: Naomi Altman [naomi at stat.psu.edu]
> >Sent: 14 June 2010 17:42
> >To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
> >Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
> >The issue is a mix of expression level and sample size. For count
> >data, the power is higher when the expression is higher. Also, the
> >p-values are discrete - the lower the total read count, the fewer
> >values are possible, which messes up the FDR estimation.
> >Of course, understanding the problem does not necessarily suggest a
> >solution. But sample sizes will need to be large (or you need to
> >sequence very deeply) if you want to detect differential expression
> >in low expressing genes.
> >At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
> > >Hi
> > >
> > >This follows on slightly from my experimental design thread.
> > >
> > >Having worked through the vignette for DESeq, it seems to work
> > >well. However, for the TagSeqExample.tab data set, when using an
> > >FDR cut off of 0.05, what we see is that we only find differential
> > >expression for large fold changes - an average of log2 fold change
> > >of 5 for up-regulated, and log2 fold change of -5 for
> > >down-regulated. There are very few significant results that even go
> > >as far down as 2 or -2 - which is still a 4-fold change.
> > >
> > >So, the question is, how many replicates must we have to get more
> > >sensitive results? Say down to log2FC of 1? (two-fold up or down
> > regulated)?
> > >
> > >I can calculate this by using DESeq's own estimates of variance to
> > >approximate replicates for T and N in the example data, and keep
> > >going until my significant results start to hit a logFC of 1, but I
> > >wanted to know if anyone else had done this yet?
> > >
> > >Thanks
> > >Mick
> > >
> > >_______________________________________________
> > >Bioconductor mailing list
> > >Bioconductor at stat.math.ethz.ch
> > >https://stat.ethz.ch/mailman/listinfo/bioconductor
> > >Search the archives:
> > >http://news.gmane.org/gmane.science.biology.informatics.conductor
> >Naomi S. Altman 814-865-3791 (voice)
> >Associate Professor
> >Dept. of Statistics 814-863-7114 (fax)
> >Penn State University 814-865-1348 (Statistics)
> >University Park, PA 16802-2111
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >Search the archives:
>Naomi S. Altman 814-865-3791 (voice)
>Dept. of Statistics 814-863-7114 (fax)
>Penn State University 814-865-1348 (Statistics)
>University Park, PA 16802-2111
Naomi S. Altman 814-865-3791 (voice)
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348 (Statistics)
University Park, PA 16802-2111
More information about the Bioconductor