[BioC] DESeq adjusted pvalue calculation / filtering data

Tue Nov 29 18:31:22 CET 2011

Just to clarify, the dissertation is the work of my student, Isaac Dialsingh.

The peak at zero indicates that you do have some differential 
expression that is greater than you would expect by chance.

Naomi

At 06:17 AM 11/29/2011, Markus Grohme wrote:
>Hi Simon, Hi Naomi,
>thanks for your feedback.
>
>The filtering removes a really nice chunk of my features:
> > use <- (rs > quantile(rs, 0.4))
> > quantile(rs,0.4)
>40%
>23
> > table(use)
>use
>FALSE TRUE
>28383 42190
>
>So after filtering my raw pvalues have two peaks in the histogram. One at
>p=0 and one at p=1. The "0" peak is half the size of "1". And some in
>between starting at half of "0" and ending at half of "1". So this tells me
>I should stay with Benjamini-Hochberg I guess. I would be really interested
>in your dissertation Naomi as I have still have a lot to learn here.
>As I understand my filtering cutoff the 0.4 quantile of my data is a sum of
>the rows <23. I guess thats fine.
>
>I agree with Simon that my experiment is flawed. And it makes sense to not
>try to optimize the statistics on flawed data. So asterisk-hunting is not
>really advisable. But as we were limited in sample AND sequencing
>capability thats what we did. Even if it is not perfect and is more of a
>exploratory study the results are quite interesting and a good starting
>point for further (and better) experiments.
>I was part interested in the inner workings of the statistical analysis of
>such data and if I could maybe improve my analysis a bit. But it seems I
>first have to improve the experiment first. :-)
>
>Thanks for your time.
>
>Cheers,
>
>Markus
>
>
>
>Am 26. November 2011 20:45 schrieb Naomi Altman <naomi at stat.psu.edu>:
>
> > I am doing research on the use of FDR methods with count data.
> >
> > Filtering definitely helps.  You want to remove features which have so few
> > counts that you cannot achieve statistical significance even if all the
> > reads come from 1 condition.  This is a bit complicated to determine using
> > DESeq due to the dispersion shrinkage, but 10 to 20 are probably good
> > cut-offs.
> >
> > Storey's method works well with count data if the estimate of pi_0 is OK.
> >  To determine this, draw a histogram of the raw p-values (from the filtered
> > data).  There should be a single peak near p=0.  If there is another peak
> > near p=1, then Storey's method does not work so well.  The Benjamini and
> > Hochberg method is more conservative, but it at least works.
> >
> > The dissertation on which my comments are based should be available by the
> > end of January.  I will post a link as soon as I am able.
> >
> > Naomi
> >
> >
> > At 02:05 PM 11/25/2011, Simon Anders wrote:
> >
> >> Dear Markus,
> >>
> >> there are several questions in your mail; I try to answer them separately.
> >>
> >> 1. Storey's qvalues: While, technically, the applicability of Storey's
> >> method might be a bit more narrow that of Benjamini and Hochberg's, within
> >> transcriptomics both are usually equally applicable, and in, Storey's does
> >> give more results.
> >>
> >> Internally, DESeq calculates the adjusted p values with something like
> >>
> >>  res$padj <- p.adjust( res$pval, method="BH" )
> >>
> >> You can also convert the raw p values (res$pval) yourself with Storey's
> >> package if you have it installed. Beware that it does not handle NAs well,
> >> you may need to take out the NA p values and put them back in.
> >>
> >> 2. Independent filtering: In the newest version of the DESeq voignette,
> >> we have added a section on independent filtering. Removing, 
> e.g., all genes
> >> with, say, an average count below 10 does give you some extra hits.
> >>
> >> 3. The real reason that you have so few hits is your lack of replicates.
> >> In this situation, DESeq reports by design only those hits that are
> >> strikingly obvious, and doing otherwise wih a sound analysis method is
> >> impossible. You cannot expect to get useful results with a flawed
> >> experimental design -- and while the two points above might give you a few
> >> extra hit, you are unlikely to get usable result without fixing your
> >> experiment.
> >>
> >> 4. Sequencing depth: Remember that it is the total number of counts per
> >> gene and _condition_ (not: sample) that gives you power for weakly
> >> expressed genes, and the number of replicates that gives your 
> power for the
> >> strongly expressed genes. Hence, whenever practically feasible, it is
> >> always better to sequence many biological replicate samples to moderate
> >> depth than to sequence a few samples very deeply. (Of course, even if
> >> replicates are difficult to obtain, two replicates is the 
> minimum. Doing an
> >> experiment without that is pointless.)
> >>
> >>  Simon
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
>         [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor