[BioC] Statistics for next-generation sequencing transcriptomics

John Herbert j.m.herbert at bham.ac.uk
Fri Jul 24 13:45:36 CEST 2009

Hello Michael.
Thanks for giving me the chance, I hope, to say something remotely useful here for onece. 
Firstly, we have recently developed a likelihood ratio statistics for looking at differential expression between two grops cDNA libraries which are effectively the same as comparing Illumina transcriptome libraries. It employs a FDR procedure q-value (Storey, Benjamini and Hochberg) for multiple testing. I don't know if 100% it will solve your problem but it accounts for different cDNA library sizes and is worth a go. The paper is here:
A link to a website that lets you download a Perl script to run these analyses is here:
Choose the StandAlone_MultiDiff and run the Perl script locally, preferably on some flavour of unix/linux. If you have any problems with it, I am happy to help. 
I think one other thing to remember is, that gene transcripts are different sizes and the sequencing from Illumina, if I am talking about the same thing, are not generated from one part of the transcript. So a shorter transcript will appear to be expressed more in your library as the RNA is fragmented, and as such, it will look like you are getting more copies expressed. There is a publication where they divided each counts for a transcript by the length of a transcript but I don't know if this is the most valid way to deal with it. If your Illumina libraries are just NGS SAGE libraries, then you don't have to worry about this and you will be fine with the above statistics. 
Sorry if this is meant to be an R/Bioconductor area only but I thought it may help Michael. 
Kind regards,
Bioinformatics Officer
Molecular Angiogenesis group
First Floor, Institute of Biomedical Research
The University of Birmingham
Medical School
B15 2TT  (UK)
j.m.herbert at bham.ac.uk
Tel:  +44 121 414 3733
Fax: +44 121 415 8677 


From: bioconductor-bounces at stat.math.ethz.ch on behalf of michael watson (IAH-C)
Sent: Fri 24/07/2009 12:22
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Statistics for next-generation sequencing transcriptomics


I'd like to have a discussion about statistics for transcriptomics using next-generation sequencing (if there hasn't already been one - if there has, then please someone point me to it!)

What we're seeing in the literature, and here at IAH, are datasets where someone has sequenced the transcriptome of two samples using something like Illumina.  These have been mapped to known sequences and counts produced.

So what we have is something like this:

geneA: 22000 sequences from 260000 match in sample 1, 43000 sequences from 507000 in sample 2.

It's been suggested that one possible approach would be to construct 2x2 contingency tables and perform Fisher's exact test or the Chi-squared test, as has been applied to SAGE data.
However, I've found that when I do that, the p-values for this type of data are incredibly, incredibly small, such that over 90% of my data points are significant, even after adjusting for multiple testing.  I assume/hope that this is because these tests were not designed to cope with this type of data.

For instance, applying Fisher's test to the example above yields a p-value of 3.798644e-23.

As I see it there are three possibilities:
1) I'm doing something wrong
2) These tests are totally inappropriate for this type of data
3) All of my data points are highly significantly different

I'm thinking that 2 is probably true, though I wouldn't rule out 1.

Any thoughts and comments are very welcome,


        [[alternative HTML version deleted]]

Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list