[BioC] Statistics for next-generation sequencing transcriptomics

Zeljko Debeljak zeljko.debeljak at gmail.com
Fri Jul 24 18:21:40 CEST 2009


Dear all,

according to the described problem maybe binomial test could be
designed for this purpose? Just a thought.

Zeljko Debeljak, PhD
Medical Biochemistry Specialist
Osijek Clinical Hospital
CROATIA

2009/7/24, Michael Dondrup <Michael.Dondrup at bccs.uib.no>:
> Hi again,
>
> I see the point now :)
>   Just another idea, did you take the number of all reads sequenced as
> a total, or the number of all reads successfully mapped to the genome/
> reference sequence? What happens, when only the mapped reads are
> counted?
> I suspect that the choice of this could influence whatever method is
> applied, because large fractions of unmapped reads (given there are
> such) could -artificially?- increase the confidence in the result.
>
> Michael
>
>
> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C):
>
>> Hi Michael
>>
>> No, you're not missing anything, I wrote down my example
>> incorrectly.  I wrote down the elements of the contingency table
>> rather than the totals, so it should have been:
>>
>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
>>> mat
>>       [,1]   [,2]
>> [1,]  22000  48000
>> [2,] 260000 507000
>>> fisher.test(mat)
>>
>>        Fisher's Exact Test for Count Data
>>
>> data:  mat
>> p-value < 2.2e-16
>> alternative hypothesis: true odds ratio is not equal to 1
>> 95 percent confidence interval:
>> 0.8789286 0.9087655
>> sample estimates:
>> odds ratio
>> 0.8937356
>>
>> Sorry about that!
>>
>> This is a case where I suspect there is a real difference, as the
>> relative frequency rises from 0.084 to 0.094.  However, as I
>> mentioned, this result is masked by all the other "significant"
>> results.  As Naomi says, it is because as the sample size gets
>> larger, we have the power to detect tiny changes as significant.
>>
>> So what is the solution?
>>
>> John Herbert has suggested something, and I will try that.
>>
>> Thanks
>> Michael
>>
>> -----Original Message-----
>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no]
>> Sent: 24 July 2009 15:00
>> To: michael watson (IAH-C)
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] Statistics for next-generation sequencing
>> transcriptomics
>>
>> Hi Michael,
>> I am having a very similar problem using 454 seq data, so I am very
>> much interested in this discussion. However,  I do not quite
>> understand how to
>> for the contigency table and to achieve such small p-value here. My
>> naive approach would be to count hits to GeneA and to count hits to
>> the rest of the genome (all - #hits to gene A), giving a pretty much
>> unbalanced 2x2 table like this:
>>> mat
>>          Sample.1 Sample.2
>> Gene.A      22000    43000
>> The.rest   238000   464000
>> but then I do not see the point here, because there is a large p
>> value, as I would expect:
>>
>>> fisher.test(mat)
>>
>> 	Fisher's Exact Test for Count Data
>>
>> data:  mat
>> p-value = 0.7717
>> alternative hypothesis: true odds ratio is not equal to 1
>> 95 percent confidence interval:
>>  0.9805937 1.0145920
>> sample estimates:
>> odds ratio
>>  0.9974594
>>
>> Am I missing something?
>>
>> Best
>> Michael
>>
>>
>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
>>
>>> Hi
>>>
>>> I'd like to have a discussion about statistics for transcriptomics
>>> using next-generation sequencing (if there hasn't already been one -
>>> if there has, then please someone point me to it!)
>>>
>>> What we're seeing in the literature, and here at IAH, are datasets
>>> where someone has sequenced the transcriptome of two samples using
>>> something like Illumina.  These have been mapped to known sequences
>>> and counts produced.
>>>
>>> So what we have is something like this:
>>>
>>> geneA: 22000 sequences from 260000 match in sample 1, 43000
>>> sequences from 507000 in sample 2.
>>>
>>> It's been suggested that one possible approach would be to construct
>>> 2x2 contingency tables and perform Fisher's exact test or the Chi-
>>> squared test, as has been applied to SAGE data.
>>> However, I've found that when I do that, the p-values for this type
>>> of data are incredibly, incredibly small, such that over 90% of my
>>> data points are significant, even after adjusting for multiple
>>> testing.  I assume/hope that this is because these tests were not
>>> designed to cope with this type of data.
>>>
>>> For instance, applying Fisher's test to the example above yields a p-
>>> value of 3.798644e-23.
>>>
>>> As I see it there are three possibilities:
>>> 1) I'm doing something wrong
>>> 2) These tests are totally inappropriate for this type of data
>>> 3) All of my data points are highly significantly different
>>>
>>> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>>>
>>> Any thoughts and comments are very welcome,
>>>
>>> Mick
>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
> Michael Dondrup, Ph.D.
> Bergen Center for Computational Science
> Computational Biology Unit
> Unifob AS - Thormøhlensgate 55, N-5008 Bergen, Norway
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list