[BioC] Statistics for next-generation sequencing transcriptomics

Sat Jul 25 06:37:12 CEST 2009

I've recently spent some time investigating this topic and on the data  
set I have examined, where a Poisson model seems to hold very well  
across technical replicate lanes (based on chi-squared goodness-of-fit  
statistics), Fisher's exact test, the likelihood ratio test, and the  
binomial test all perform equivalently in terms of identifying  
differentially expressed genes and in terms of producing very low p- 
values.  I also compared against a recently developed method for SAGE  
analysis, the edgeR package of Robinson and Smyth, but perhaps due to  
the almost exact Poisson variability in my data, their method did not  
perform better than the others. (The method is designed to work  
particularly well on overdispersed Poisson data.)  And using just  
mapped reads for the lane totals (suggested in this thread by Michael  
Dondrup) did not have much effect - the counts were still very high.

I think overall that Naomi is right - in this Poisson context,  
frequentist methods will produce very small p-values.  I do think that  
part of the problem is that most RNASeq data sets produced today are  
not capturing biological variability, so we are really just doing a  
one-versus-one comparison even when we have "replicate" lanes (which  
are almost exclusively technical replicates, or at best replicates on  
cell lines) and hopefully once this is taken into account more  
reasonable test statistics will be seen.

I also agree with John Herbert's point about transcript length having  
an effect here, and I'm not sure that a good solution has been  
developed for this yet either, see Oshlack and Wakefield's recent  
paper for a discussion: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&tool=pmcentrez 
.  They point out that for Poisson data, dividing by transcript length  
does not fix the length-bias problem.

Cheers,
Margaret

Margaret Taub
PhD Candidate
UC Berkeley Dept of Statistics

>> From: Zeljko Debeljak <zeljko.debeljak at gmail.com>
>> Date: July 24, 2009 9:21:40 AM PDT
>> To: Michael Dondrup <Michael.Dondrup at bccs.uib.no>
>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] Statistics for next-generation sequencing  
>> transcriptomics
>>
>> Dear all,
>>
>> according to the described problem maybe binomial test could be
>> designed for this purpose? Just a thought.
>>
>> Zeljko Debeljak, PhD
>> Medical Biochemistry Specialist
>> Osijek Clinical Hospital
>> CROATIA
>>
>> 2009/7/24, Michael Dondrup <Michael.Dondrup at bccs.uib.no>:
>>> Hi again,
>>>
>>> I see the point now :)
>>> Just another idea, did you take the number of all reads sequenced as
>>> a total, or the number of all reads successfully mapped to the  
>>> genome/
>>> reference sequence? What happens, when only the mapped reads are
>>> counted?
>>> I suspect that the choice of this could influence whatever method is
>>> applied, because large fractions of unmapped reads (given there are
>>> such) could -artificially?- increase the confidence in the result.
>>>
>>> Michael
>>>
>>>
>>> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C):
>>>
>>>> Hi Michael
>>>>
>>>> No, you're not missing anything, I wrote down my example
>>>> incorrectly.  I wrote down the elements of the contingency table
>>>> rather than the totals, so it should have been:
>>>>
>>>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
>>>>> mat
>>>>     [,1]   [,2]
>>>> [1,]  22000  48000
>>>> [2,] 260000 507000
>>>>> fisher.test(mat)
>>>>
>>>>      Fisher's Exact Test for Count Data
>>>>
>>>> data:  mat
>>>> p-value < 2.2e-16
>>>> alternative hypothesis: true odds ratio is not equal to 1
>>>> 95 percent confidence interval:
>>>> 0.8789286 0.9087655
>>>> sample estimates:
>>>> odds ratio
>>>> 0.8937356
>>>>
>>>> Sorry about that!
>>>>
>>>> This is a case where I suspect there is a real difference, as the
>>>> relative frequency rises from 0.084 to 0.094.  However, as I
>>>> mentioned, this result is masked by all the other "significant"
>>>> results.  As Naomi says, it is because as the sample size gets
>>>> larger, we have the power to detect tiny changes as significant.
>>>>
>>>> So what is the solution?
>>>>
>>>> John Herbert has suggested something, and I will try that.
>>>>
>>>> Thanks
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no]
>>>> Sent: 24 July 2009 15:00
>>>> To: michael watson (IAH-C)
>>>> Cc: bioconductor at stat.math.ethz.ch
>>>> Subject: Re: [BioC] Statistics for next-generation sequencing
>>>> transcriptomics
>>>>
>>>> Hi Michael,
>>>> I am having a very similar problem using 454 seq data, so I am very
>>>> much interested in this discussion. However,  I do not quite
>>>> understand how to
>>>> for the contigency table and to achieve such small p-value here. My
>>>> naive approach would be to count hits to GeneA and to count hits to
>>>> the rest of the genome (all - #hits to gene A), giving a pretty  
>>>> much
>>>> unbalanced 2x2 table like this:
>>>>> mat
>>>>        Sample.1 Sample.2
>>>> Gene.A      22000    43000
>>>> The.rest   238000   464000
>>>> but then I do not see the point here, because there is a large p
>>>> value, as I would expect:
>>>>
>>>>> fisher.test(mat)
>>>>
>>>> 	Fisher's Exact Test for Count Data
>>>>
>>>> data:  mat
>>>> p-value = 0.7717
>>>> alternative hypothesis: true odds ratio is not equal to 1
>>>> 95 percent confidence interval:
>>>> 0.9805937 1.0145920
>>>> sample estimates:
>>>> odds ratio
>>>> 0.9974594
>>>>
>>>> Am I missing something?
>>>>
>>>> Best
>>>> Michael
>>>>
>>>>
>>>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
>>>>
>>>>> Hi
>>>>>
>>>>> I'd like to have a discussion about statistics for transcriptomics
>>>>> using next-generation sequencing (if there hasn't already been  
>>>>> one -
>>>>> if there has, then please someone point me to it!)
>>>>>
>>>>> What we're seeing in the literature, and here at IAH, are datasets
>>>>> where someone has sequenced the transcriptome of two samples using
>>>>> something like Illumina.  These have been mapped to known  
>>>>> sequences
>>>>> and counts produced.
>>>>>
>>>>> So what we have is something like this:
>>>>>
>>>>> geneA: 22000 sequences from 260000 match in sample 1, 43000
>>>>> sequences from 507000 in sample 2.
>>>>>
>>>>> It's been suggested that one possible approach would be to  
>>>>> construct
>>>>> 2x2 contingency tables and perform Fisher's exact test or the Chi-
>>>>> squared test, as has been applied to SAGE data.
>>>>> However, I've found that when I do that, the p-values for this  
>>>>> type
>>>>> of data are incredibly, incredibly small, such that over 90% of my
>>>>> data points are significant, even after adjusting for multiple
>>>>> testing.  I assume/hope that this is because these tests were not
>>>>> designed to cope with this type of data.
>>>>>
>>>>> For instance, applying Fisher's test to the example above yields  
>>>>> a p-
>>>>> value of 3.798644e-23.
>>>>>
>>>>> As I see it there are three possibilities:
>>>>> 1) I'm doing something wrong
>>>>> 2) These tests are totally inappropriate for this type of data
>>>>> 3) All of my data points are highly significantly different
>>>>>
>>>>> I'm thinking that 2 is probably true, though I wouldn't rule out  
>>>>> 1.
>>>>>
>>>>> Any thoughts and comments are very welcome,
>>>>>
>>>>> Mick
>>>>>
>>>>>
>>>>> 	[[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>
>>>
>>> Michael Dondrup, Ph.D.
>>> Bergen Center for Computational Science
>>> Computational Biology Unit
>>> Unifob AS - Thormøhlensgate 55, N-5008 Bergen, Norway
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor