[BioC] DESeq adjusted pvalue calculation / filtering data

Naomi Altman naomi at stat.psu.edu
Sat Nov 26 20:45:33 CET 2011


I am doing research on the use of FDR methods with count data.

Filtering definitely helps.  You want to remove features which have 
so few counts that you cannot achieve statistical significance even 
if all the reads come from 1 condition.  This is a bit complicated to 
determine using DESeq due to the dispersion shrinkage, but 10 to 20 
are probably good cut-offs.

Storey's method works well with count data if the estimate of pi_0 is 
OK.  To determine this, draw a histogram of the raw p-values (from 
the filtered data).  There should be a single peak near p=0.  If 
there is another peak near p=1, then Storey's method does not work so 
well.  The Benjamini and Hochberg method is more conservative, but it 
at least works.

The dissertation on which my comments are based should be available 
by the end of January.  I will post a link as soon as I am able.

Naomi

At 02:05 PM 11/25/2011, Simon Anders wrote:
>Dear Markus,
>
>there are several questions in your mail; I try to answer them separately.
>
>1. Storey's qvalues: While, technically, the applicability of 
>Storey's method might be a bit more narrow that of Benjamini and 
>Hochberg's, within transcriptomics both are usually equally 
>applicable, and in, Storey's does give more results.
>
>Internally, DESeq calculates the adjusted p values with something like
>
>   res$padj <- p.adjust( res$pval, method="BH" )
>
>You can also convert the raw p values (res$pval) yourself with 
>Storey's package if you have it installed. Beware that it does not 
>handle NAs well, you may need to take out the NA p values and put them back in.
>
>2. Independent filtering: In the newest version of the DESeq 
>voignette, we have added a section on independent filtering. 
>Removing, e.g., all genes with, say, an average count below 10 does 
>give you some extra hits.
>
>3. The real reason that you have so few hits is your lack of 
>replicates. In this situation, DESeq reports by design only those 
>hits that are strikingly obvious, and doing otherwise wih a sound 
>analysis method is impossible. You cannot expect to get useful 
>results with a flawed experimental design -- and while the two 
>points above might give you a few extra hit, you are unlikely to get 
>usable result without fixing your experiment.
>
>4. Sequencing depth: Remember that it is the total number of counts 
>per gene and _condition_ (not: sample) that gives you power for 
>weakly expressed genes, and the number of replicates that gives your 
>power for the strongly expressed genes. Hence, whenever practically 
>feasible, it is always better to sequence many biological replicate 
>samples to moderate depth than to sequence a few samples very 
>deeply. (Of course, even if replicates are difficult to obtain, two 
>replicates is the minimum. Doing an experiment without that is pointless.)
>
>   Simon
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list