[BioC] Statistics for next-generation sequencing transcriptomics

Cei Abreu-Goodger cei at ebi.ac.uk
Fri Jul 24 17:36:38 CEST 2009


I don't know if this helps at all, but if the statistic is testing the 
null hypothesis that the counts are equal, would it not help to test for 
a difference > x? Of course we would then have to arbitrarily define 
"x", but perhaps we could get more biologically meaningful results?

Cei

michael watson (IAH-C) wrote:
> Hi Michael
> 
> No, you're not missing anything, I wrote down my example incorrectly.  I wrote down the elements of the contingency table rather than the totals, so it should have been:
> 
>> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
>> mat
>        [,1]   [,2]
> [1,]  22000  48000
> [2,] 260000 507000
>> fisher.test(mat)
> 
>         Fisher's Exact Test for Count Data
> 
> data:  mat 
> p-value < 2.2e-16
> alternative hypothesis: true odds ratio is not equal to 1 
> 95 percent confidence interval:
>  0.8789286 0.9087655 
> sample estimates:
> odds ratio 
>  0.8937356 
> 
> Sorry about that!
> 
> This is a case where I suspect there is a real difference, as the relative frequency rises from 0.084 to 0.094.  However, as I mentioned, this result is masked by all the other "significant" results.  As Naomi says, it is because as the sample size gets larger, we have the power to detect tiny changes as significant.  
> 
> So what is the solution?
> 
> John Herbert has suggested something, and I will try that.
> 
> Thanks
> Michael
> 
> -----Original Message-----
> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] 
> Sent: 24 July 2009 15:00
> To: michael watson (IAH-C)
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Statistics for next-generation sequencing transcriptomics
> 
> Hi Michael,
> I am having a very similar problem using 454 seq data, so I am very  
> much interested in this discussion. However,  I do not quite  
> understand how to
> for the contigency table and to achieve such small p-value here. My  
> naive approach would be to count hits to GeneA and to count hits to  
> the rest of the genome (all - #hits to gene A), giving a pretty much  
> unbalanced 2x2 table like this:
>  > mat
>           Sample.1 Sample.2
> Gene.A      22000    43000
> The.rest   238000   464000
> but then I do not see the point here, because there is a large p  
> value, as I would expect:
> 
>  > fisher.test(mat)
> 
> 	Fisher's Exact Test for Count Data
> 
> data:  mat
> p-value = 0.7717
> alternative hypothesis: true odds ratio is not equal to 1
> 95 percent confidence interval:
>   0.9805937 1.0145920
> sample estimates:
> odds ratio
>   0.9974594
> 
> Am I missing something?
> 
> Best
> Michael
> 
> 
> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
> 
>> Hi
>>
>> I'd like to have a discussion about statistics for transcriptomics  
>> using next-generation sequencing (if there hasn't already been one -  
>> if there has, then please someone point me to it!)
>>
>> What we're seeing in the literature, and here at IAH, are datasets  
>> where someone has sequenced the transcriptome of two samples using  
>> something like Illumina.  These have been mapped to known sequences  
>> and counts produced.
>>
>> So what we have is something like this:
>>
>> geneA: 22000 sequences from 260000 match in sample 1, 43000  
>> sequences from 507000 in sample 2.
>>
>> It's been suggested that one possible approach would be to construct  
>> 2x2 contingency tables and perform Fisher's exact test or the Chi- 
>> squared test, as has been applied to SAGE data.
>> However, I've found that when I do that, the p-values for this type  
>> of data are incredibly, incredibly small, such that over 90% of my  
>> data points are significant, even after adjusting for multiple  
>> testing.  I assume/hope that this is because these tests were not  
>> designed to cope with this type of data.
>>
>> For instance, applying Fisher's test to the example above yields a p- 
>> value of 3.798644e-23.
>>
>> As I see it there are three possibilities:
>> 1) I'm doing something wrong
>> 2) These tests are totally inappropriate for this type of data
>> 3) All of my data points are highly significantly different
>>
>> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>>
>> Any thoughts and comments are very welcome,
>>
>> Mick
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list