# [BioC] Statistics for next-generation sequencing transcriptomics

Michael Dondrup Michael.Dondrup at bccs.uib.no
Fri Jul 24 17:51:31 CEST 2009

```Hi again,

I see the point now :)
Just another idea, did you take the number of all reads sequenced as
a total, or the number of all reads successfully mapped to the genome/
reference sequence? What happens, when only the mapped reads are
counted?
I suspect that the choice of this could influence whatever method is
applied, because large fractions of unmapped reads (given there are
such) could -artificially?- increase the confidence in the result.

Michael

Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C):

> Hi Michael
>
> No, you're not missing anything, I wrote down my example
> incorrectly.  I wrote down the elements of the contingency table
> rather than the totals, so it should have been:
>
>> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
>> mat
>       [,1]   [,2]
> [1,]  22000  48000
> [2,] 260000 507000
>> fisher.test(mat)
>
>        Fisher's Exact Test for Count Data
>
> data:  mat
> p-value < 2.2e-16
> alternative hypothesis: true odds ratio is not equal to 1
> 95 percent confidence interval:
> 0.8789286 0.9087655
> sample estimates:
> odds ratio
> 0.8937356
>
>
> This is a case where I suspect there is a real difference, as the
> relative frequency rises from 0.084 to 0.094.  However, as I
> mentioned, this result is masked by all the other "significant"
> results.  As Naomi says, it is because as the sample size gets
> larger, we have the power to detect tiny changes as significant.
>
> So what is the solution?
>
> John Herbert has suggested something, and I will try that.
>
> Thanks
> Michael
>
> -----Original Message-----
> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no]
> Sent: 24 July 2009 15:00
> To: michael watson (IAH-C)
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Statistics for next-generation sequencing
> transcriptomics
>
> Hi Michael,
> I am having a very similar problem using 454 seq data, so I am very
> much interested in this discussion. However,  I do not quite
> understand how to
> for the contigency table and to achieve such small p-value here. My
> naive approach would be to count hits to GeneA and to count hits to
> the rest of the genome (all - #hits to gene A), giving a pretty much
> unbalanced 2x2 table like this:
>> mat
>          Sample.1 Sample.2
> Gene.A      22000    43000
> The.rest   238000   464000
> but then I do not see the point here, because there is a large p
> value, as I would expect:
>
>> fisher.test(mat)
>
> 	Fisher's Exact Test for Count Data
>
> data:  mat
> p-value = 0.7717
> alternative hypothesis: true odds ratio is not equal to 1
> 95 percent confidence interval:
>  0.9805937 1.0145920
> sample estimates:
> odds ratio
>  0.9974594
>
> Am I missing something?
>
> Best
> Michael
>
>
> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
>
>> Hi
>>
>> I'd like to have a discussion about statistics for transcriptomics
>> using next-generation sequencing (if there hasn't already been one -
>> if there has, then please someone point me to it!)
>>
>> What we're seeing in the literature, and here at IAH, are datasets
>> where someone has sequenced the transcriptome of two samples using
>> something like Illumina.  These have been mapped to known sequences
>> and counts produced.
>>
>> So what we have is something like this:
>>
>> geneA: 22000 sequences from 260000 match in sample 1, 43000
>> sequences from 507000 in sample 2.
>>
>> It's been suggested that one possible approach would be to construct
>> 2x2 contingency tables and perform Fisher's exact test or the Chi-
>> squared test, as has been applied to SAGE data.
>> However, I've found that when I do that, the p-values for this type
>> of data are incredibly, incredibly small, such that over 90% of my
>> data points are significant, even after adjusting for multiple
>> testing.  I assume/hope that this is because these tests were not
>> designed to cope with this type of data.
>>
>> For instance, applying Fisher's test to the example above yields a p-
>> value of 3.798644e-23.
>>
>> As I see it there are three possibilities:
>> 1) I'm doing something wrong
>> 2) These tests are totally inappropriate for this type of data
>> 3) All of my data points are highly significantly different
>>
>> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>>
>> Any thoughts and comments are very welcome,
>>
>> Mick
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>

Michael Dondrup, Ph.D.
Bergen Center for Computational Science
Computational Biology Unit
Unifob AS - Thormøhlensgate 55, N-5008 Bergen, Norway

```