[BioC] DESeq and number of replicates required for RNA-Seq

Tue Jun 15 17:12:38 CEST 2010

Hi Mick,

Do you know how much PCR amplification was used for your RNA-Seq 
datasets? Could this be one factor leading to more apparent variation?

I'm surprised at your results though, I seem to remember many papers 
with dot-plots comparing expression measurements between biological 
replicates, which show that the RNA-Seq derived results are much tighter 
(less variation) than the array based ones.

Cheers,

Cei

michael watson (IAH-C) wrote:
> Hi Mark
> 
> Thanks for the reply.  I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice!
> 
> What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE.  Using those technologies, we have been able to detect small but significant changes.  Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes.  Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of.
> 
> In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes.
> 
> I suspect we haven't gotten to the bottom of RNA-Seq data just yet.
> 
> Mick
> 
> ________________________________________
> From: Mark Robinson [mrobinson at wehi.EDU.AU]
> Sent: 15 June 2010 12:20
> To: michael watson (IAH-C)
> Cc: bioc list; Naomi Altman
> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
> 
> Hi Mick.
> 
> I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases.  So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation.   Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder.
> 
> Naomi:  I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ?
> 
> Cheers,
> Mark
> 
> On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote:
> 
>> Thanks Naomi
>>
>> Yes, I have several RNA-Seq datasets that look like they may have large biological variation.
>>
>> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount.
>>
>> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25).
>>
>> And of course there is cufflinks
>>
>> Thanks
>> Mick
>> ________________________________________
>> From: Naomi Altman [naomi at stat.psu.edu]
>> Sent: 15 June 2010 03:02
>> To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>>
>> Hi Michael,
>> I was working this out for a lecture and here is what I found:
>>
>> If there is enough expression for the Normal approximation to hold
>> then here is a rule of thumb.
>>
>> Suppose that the total number of reads is identical for all samples
>> and that there is NO biological variation.  If Yi is the number of
>> reads for a gene in sample i, then
>> Poisson variation alone leads to log(Yi) approx normal with variance
>> 1/4.  (This is what the DESeq vignette calls "shot" variance.)
>>
>> Using the formula for a 2 sample t-test, you see that to detect
>> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
>> n>32 var/log(fold) which is approximately 8 biological reps per treatment.
>>
>> However, that is for NO biological variation.  (Have a look at the
>> example in the DESeq vignette!) And is assumes alpha=.05 (but we are
>> going to use a much smaller alpha due to the multiple comparisons
>> adjustment).
>>
>> --Naomi
>>
>>
>> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
>>> Hi Naomi
>>>
>>> Thanks for the reply.
>>>
>>> The issue isn't necessarily low expressing genes, but perhaps high
>>> expressing genes with a small (ish) fold change.  DESeq seems to
>>> only report as significant differences that are high fold changes.
>>>
>>> Contrast this to limma for microarrays, where small fold changes can
>>> be reported as significant.
>>>
>>> For whatever reason, the transcriptomic community have become
>>> fixated on "two-fold" as some kind of standard cut-off.  Now, I'm
>>> not fixated on that, but the example in DESeq reports 428
>>> significant genes with an estimated fold change at FDR 5%, however,
>>> NONE of these are in the range -2 : 2.  The minimum positive logFC
>>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
>>> 2.49 (5.65 fold down-regulation).
>>>
>>> So what I am concerned about is finding genes, either highly or
>>> lowly expressed, that are differing by a small fold change - say two-fold.
>>>
>>> Thanks
>>> Mick
>>> ________________________________________
>>> From: Naomi Altman [naomi at stat.psu.edu]
>>> Sent: 14 June 2010 17:42
>>> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
>>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>>>
>>> The issue is a mix of expression level and sample size.  For count
>>> data, the power is higher when the expression is higher.  Also, the
>>> p-values are discrete - the lower the total read count, the fewer
>>> values are possible, which messes up the FDR estimation.
>>>
>>> Of course, understanding the problem does not necessarily suggest a
>>> solution.  But sample sizes will need to be large (or you need to
>>> sequence very deeply) if you want to detect differential expression
>>> in low expressing genes.
>>>
>>> --Naomi
>>>
>>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
>>>> Hi
>>>>
>>>> This follows on slightly from my experimental design thread.
>>>>
>>>> Having worked through the vignette for DESeq, it seems to work
>>>> well.  However, for the TagSeqExample.tab data set, when using an
>>>> FDR cut off of 0.05, what we see is that we only find differential
>>>> expression for large fold changes - an average of log2 fold change
>>>> of 5 for up-regulated, and log2 fold change of -5 for
>>>> down-regulated.  There are very few significant results that even go
>>>> as far down as 2 or -2 - which is still a 4-fold change.
>>>>
>>>> So, the question is, how many replicates must we have to get more
>>>> sensitive results?  Say down to log2FC of 1? (two-fold up or down
>>> regulated)?
>>>> I can calculate this by using DESeq's own estimates of variance to
>>>> approximate replicates for T and N in the example data, and keep
>>>> going until my significant results start to hit a logFC of 1, but I
>>>> wanted to know if anyone else had done this yet?
>>>>
>>>> Thanks
>>>> Mick
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> Naomi S. Altman                                814-865-3791 (voice)
>>> Associate Professor
>>> Dept. of Statistics                              814-863-7114 (fax)
>>> Penn State University                         814-865-1348 (Statistics)
>>> University Park, PA 16802-2111
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> Naomi S. Altman                                814-865-3791 (voice)
>> Associate Professor
>> Dept. of Statistics                              814-863-7114 (fax)
>> Penn State University                         814-865-1348 (Statistics)
>> University Park, PA 16802-2111
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: m.robinson at garvan.org.au
> e: mrobinson at wehi.edu.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> ------------------------------
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:6}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor