[BioC] Statistics for next-generation sequencing transcriptomics

Michael Dondrup Michael.Dondrup at bccs.uib.no
Fri Jul 24 16:00:03 CEST 2009


Hi Michael,
I am having a very similar problem using 454 seq data, so I am very  
much interested in this discussion. However,  I do not quite  
understand how to
for the contigency table and to achieve such small p-value here. My  
naive approach would be to count hits to GeneA and to count hits to  
the rest of the genome (all - #hits to gene A), giving a pretty much  
unbalanced 2x2 table like this:
 > mat
          Sample.1 Sample.2
Gene.A      22000    43000
The.rest   238000   464000
but then I do not see the point here, because there is a large p  
value, as I would expect:

 > fisher.test(mat)

	Fisher's Exact Test for Count Data

data:  mat
p-value = 0.7717
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.9805937 1.0145920
sample estimates:
odds ratio
  0.9974594

Am I missing something?

Best
Michael


Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):

> Hi
>
> I'd like to have a discussion about statistics for transcriptomics  
> using next-generation sequencing (if there hasn't already been one -  
> if there has, then please someone point me to it!)
>
> What we're seeing in the literature, and here at IAH, are datasets  
> where someone has sequenced the transcriptome of two samples using  
> something like Illumina.  These have been mapped to known sequences  
> and counts produced.
>
> So what we have is something like this:
>
> geneA: 22000 sequences from 260000 match in sample 1, 43000  
> sequences from 507000 in sample 2.
>
> It's been suggested that one possible approach would be to construct  
> 2x2 contingency tables and perform Fisher's exact test or the Chi- 
> squared test, as has been applied to SAGE data.
> However, I've found that when I do that, the p-values for this type  
> of data are incredibly, incredibly small, such that over 90% of my  
> data points are significant, even after adjusting for multiple  
> testing.  I assume/hope that this is because these tests were not  
> designed to cope with this type of data.
>
> For instance, applying Fisher's test to the example above yields a p- 
> value of 3.798644e-23.
>
> As I see it there are three possibilities:
> 1) I'm doing something wrong
> 2) These tests are totally inappropriate for this type of data
> 3) All of my data points are highly significantly different
>
> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>
> Any thoughts and comments are very welcome,
>
> Mick
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list