[BioC] DESeq and number of replicates required for RNA-Seq

Mark Robinson mrobinson at wehi.EDU.AU
Tue Jun 15 13:20:49 CEST 2010


Hi Mick.

I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases.  So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation.   Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder.

Naomi:  I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ?

Cheers,
Mark

On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote:

> Thanks Naomi
> 
> Yes, I have several RNA-Seq datasets that look like they may have large biological variation.
> 
> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount.
> 
> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25).
> 
> And of course there is cufflinks
> 
> Thanks
> Mick
> ________________________________________
> From: Naomi Altman [naomi at stat.psu.edu]
> Sent: 15 June 2010 03:02
> To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
> 
> Hi Michael,
> I was working this out for a lecture and here is what I found:
> 
> If there is enough expression for the Normal approximation to hold
> then here is a rule of thumb.
> 
> Suppose that the total number of reads is identical for all samples
> and that there is NO biological variation.  If Yi is the number of
> reads for a gene in sample i, then
> Poisson variation alone leads to log(Yi) approx normal with variance
> 1/4.  (This is what the DESeq vignette calls "shot" variance.)
> 
> Using the formula for a 2 sample t-test, you see that to detect
> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
> n>32 var/log(fold) which is approximately 8 biological reps per treatment.
> 
> However, that is for NO biological variation.  (Have a look at the
> example in the DESeq vignette!) And is assumes alpha=.05 (but we are
> going to use a much smaller alpha due to the multiple comparisons
> adjustment).
> 
> --Naomi
> 
> 
> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
>> Hi Naomi
>> 
>> Thanks for the reply.
>> 
>> The issue isn't necessarily low expressing genes, but perhaps high
>> expressing genes with a small (ish) fold change.  DESeq seems to
>> only report as significant differences that are high fold changes.
>> 
>> Contrast this to limma for microarrays, where small fold changes can
>> be reported as significant.
>> 
>> For whatever reason, the transcriptomic community have become
>> fixated on "two-fold" as some kind of standard cut-off.  Now, I'm
>> not fixated on that, but the example in DESeq reports 428
>> significant genes with an estimated fold change at FDR 5%, however,
>> NONE of these are in the range -2 : 2.  The minimum positive logFC
>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
>> 2.49 (5.65 fold down-regulation).
>> 
>> So what I am concerned about is finding genes, either highly or
>> lowly expressed, that are differing by a small fold change - say two-fold.
>> 
>> Thanks
>> Mick
>> ________________________________________
>> From: Naomi Altman [naomi at stat.psu.edu]
>> Sent: 14 June 2010 17:42
>> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>> 
>> The issue is a mix of expression level and sample size.  For count
>> data, the power is higher when the expression is higher.  Also, the
>> p-values are discrete - the lower the total read count, the fewer
>> values are possible, which messes up the FDR estimation.
>> 
>> Of course, understanding the problem does not necessarily suggest a
>> solution.  But sample sizes will need to be large (or you need to
>> sequence very deeply) if you want to detect differential expression
>> in low expressing genes.
>> 
>> --Naomi
>> 
>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
>>> Hi
>>> 
>>> This follows on slightly from my experimental design thread.
>>> 
>>> Having worked through the vignette for DESeq, it seems to work
>>> well.  However, for the TagSeqExample.tab data set, when using an
>>> FDR cut off of 0.05, what we see is that we only find differential
>>> expression for large fold changes - an average of log2 fold change
>>> of 5 for up-regulated, and log2 fold change of -5 for
>>> down-regulated.  There are very few significant results that even go
>>> as far down as 2 or -2 - which is still a 4-fold change.
>>> 
>>> So, the question is, how many replicates must we have to get more
>>> sensitive results?  Say down to log2FC of 1? (two-fold up or down
>> regulated)?
>>> 
>>> I can calculate this by using DESeq's own estimates of variance to
>>> approximate replicates for T and N in the example data, and keep
>>> going until my significant results start to hit a logFC of 1, but I
>>> wanted to know if anyone else had done this yet?
>>> 
>>> Thanks
>>> Mick
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> Naomi S. Altman                                814-865-3791 (voice)
>> Associate Professor
>> Dept. of Statistics                              814-863-7114 (fax)
>> Penn State University                         814-865-1348 (Statistics)
>> University Park, PA 16802-2111
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------






______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list