[BioC] DESeq adjusted pvalue calculation / filtering data

Markus Grohme 000.calabi.yau.000 at googlemail.com
Thu Nov 24 11:39:11 CET 2011


Dear all,

Simon Anders kindly asked me to repeat my question on this mailing list
considering the statistical analysis of my RNA sequencing data.

I am rather new to Bioconductor and RNA sequencing analysis (molecular
biologist) and tried to read myself a bit into the statistics behind the
analysis in DESeq and found the publication of "Storey & Tibshirani, 2003"
and tested also the R program "qvalue". In the paper he authors claim that
that the Benjamini/Hochberg method is too conservative for most genomics
studies. After running my analysis in DESeq I did some analysis of the
pvalues using "qvalue". How does the way the adjusted p-value is calculated
using Benjamini/Hochberg and qvalue differ, the ones generated by DESeq
(which uses the BH method as I understand it). The package DE"G"Seq (Wang
et. al 2010) allows both B/H and qvalue methods to be used for adj-pvalue
calculation. The Bourgon_2010 paper for independent filtering also briefly
mentions qvalue (ref.23).
I am not really a statistician, not even a real bioinformatician so I would
like to ask more experienced people dealing with this kind of data on a
regular basis. Some of the programs have been written with microarrays in
mind so I am not sure how this will perform on sequencing count data. I am
also not sure how my experimental setup fits into all of this.

My experimental setup is (not standard I think):
4 biological samples no biological no technical replicate (I know far from
perfect)
1 lane Illumina GAII 36bp single end each where i get around 7-8 Mio reads
mapped (basis of my counts).

I do not sample over the whole transcript but sequenced in a window of ~300
bases of 3' (fragmented cDNA / PolyA selected). This means all my tags are
clustered around the 3' of the transcript. This gives me the advantage that
my counts are not distributed over sequences that I can't connect
informatically without a reference genome. My reference was a transcriptome
assembly (non-model organism, Sanger/454/Illumina data). So I have about
70.000 features as a reference but due to the nature of a transcriptome
assembly I get some fragmentation. E.g. a transcript is represented in 3
separate contigs. Of these 3 contigs only the 3' most gets read counts
assigned. This means I am testing against 3 features of which only 1 is
capable of giving me informative results.
I know that the experiment is not the best setup. But there should be
relatively little biological variability as I pooled 200 synchronized
individuals that are also more or less clones because they reproduce
asexually. Also the 4 stages are a circular time course that is only some
hours apart and the gene expression patterns should overlap. So the e.g.
A->B->C->D(->A) share in part expression changes with their adjacent sample
(D will finally end up being A again.) My interest is finding out what is
different in B,C,D compared to A that is my normal physiological state.

I am concerned how my overabundance of uninformative features generates a
multiple testing correction problem due to my experimental setup. Many tag
count rows give just 0,0,0,0 or spurious hits like 0,2,0,5 due to hits more
5' in some fragments not connected by the assembly to the 3'. Should I
remove the rows that are just background noise? Or should I let DESeq do
the filtering using the independent filtering option (if it is even
possible or makes sense for my data). I also attached a picture of A vs B
and marked at 5% false discovery rate cutoff from DESeq. It seems that the
changes are not very subtle. I understand that by not having any replicates
my test will not have strong power to detect small differences. So will
filtering actually help here? Also it seems that my sequencing depth is
also on the lower end so I would need to get more data to statistically
venture into the low expression fraction.

Does anybody have some clues how to approach this and what makes sense and
what doesn't?

Cheers,

Markus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: A_vs_B_q05.png
Type: image/png
Size: 39357 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20111124/744f00eb/attachment.png>


More information about the Bioconductor mailing list