[BioC] estimateDispersions in DESeq

Wolfgang Huber whuber at embl.de
Fri Jun 22 00:29:03 CEST 2012


Dear Flavia

Thank you for your detailed and informative feedback!

Simon will be better able to address the question regarding what exactly 
changed between the versions with respect to the different parameter 
choices (sharingMode, fitType, method).

Two other comments:
1. The fact that you see such a strong dependence on these choices might 
be a symptom of there being outliers (either whole samples being 
outliers, or certain measurements in some samples). Outlier detection is 
generally difficult to automate, yet outliers can have an excessive 
impact on inference, especially with such small sample sizes. Did you 
have a look at the 'pairs' plot between all 6 communities (on a log or 
log-like scale)?

2. The dispersion-mean model was developed on RNA-Seq data. I have no 
experience how well it fits to OTU counts in metagenomics. Your 
observation on the plot may indicate a problem. If you don't mind, I'd 
be interested in having a look at your data (you can anonymize the OTUs) 
to see how well the dispersion-mean model used by DESeq fits that.

	Best wishes
	Wolfgang

-> More below. ->

Jun/21/12 2:50 PM, Flavia Nunes scripsit::
> Dear List,
>
> I am trying to use DESeq to analyse a dataset where we have samples of 3
> healthy and 3 diseased microbial communities, and we are trying to
> establish which OTUs are significantly more or less abundant in the healthy
> vs diseased samples.
>
> I tried running the new version of DESeq (1.8.3) on both a Mac and a PC
> running the latest version of R (2.15).  Both versions give  a strange
> result, where all OTUs have a padj value that is >0.7.  I found this to be
> strange, because when looking at the raw count data, it is obvious that
> some OTUs are abundant in the one treatment (say, high counts in all of the
> heathy samples) and absent in the other (0 or close to 0 on all of the
> diseased samples).

Due to the variance-sharing of 'genes' with similar mean counts, this 
could easily happen if there are *other* OTU with similar counts, but 
highly discordant values within groups, forcing a high dispersion 
estimate, and costing power even for those OTUs that you mention.

To avoid the dispersion-mean model, if you are willing to make the jump: 
with 3 vs 3 samples you're in a place where sharingMode = 
"gene-est-only" might already be useful - but I would consider the above 
suggestions first.

>
> I asked a colleague to help me with the analysis, and he ran the analysis
> on an older version of DESeq (1.4), using the estimateVarianceFunction
> command instead of estimateDispersions.  We saw that in the help file for
> estimateDispersions, that by using the sharingMode="fit-only",
> fitType="local" options, we should be able to get the same result as the
> estimateVarianceFunction.  However, this is not the case.  DESeq 1.4 was
> able to find 54 OTUs that were significantly different from healthy vs
> diseased samples, while DESeq 1.8.3 found that none of the OTUs were
> significantly different in healthy vs diseased.
>
> In a second attempt, we used the option method="per-condition" and this
> worked - I got the same 54 significant p-values as in the analysis with
> DESeq 1.4  But when I continued the analysis on other datasets (we have a
> number of different conditions), I again started to get odd p-values, such
> as 1.00 for every OTU.  I changed the setting for the estimateDispersions
> command, trying different methods, and each time I would get a different
> set of p-values, but usually very high numbers, close to 1.
>
> It seems to me that the results are really sensitive to the method used to
> estimate dispersions, and I was wondering what are the properties of the
> data that I might have to look for in order to select the best method.
>
> Another unusual thing that I have noticed is that when I plot the
> Dispersion Estimates, the fit line deviates from the points towards the
> right side of the graph.  This suggests to me that there must be something
> wrong with the fit estimate, but I do not know how I might be able to
> change the settings to get a better fit.
>
> I wanted to know if anyone on the list has come across a similar problem?
>
> I am using the commands below in DESeq.  I can provide files of the data,
> as well as the results that I am receiving to anyone that might be
> interested in taking a closer look.
>
> WBDCountTable <- read.table( file.choose(), header=TRUE, row.names=1 )
> WBDDesign <- data.frame(row.names = colnames( WBDCountTable ), condition =
> c( "D1", "D2", "D3", "H1", "H2", "H3"), libType = c( "single-end",
> "single-end", "single-end", "single-end", "single-end", "single-end" ) )
> conds <- factor( c(  "D", "D", "D", "H", "H", "H"  ) )
> cds <- newCountDataSet( WBDCountTable, conds )
> cds <- estimateSizeFactors( cds )
> cds <- estimateDispersions( cds, method="per-condition", fitType="local" )
>
> plotDispEsts <- function( cds )
> {
>   plot(
> rowMeans( counts( cds ) ),
> fitInfo(cds)$perGeneDispEsts,
> pch = 16, cex=1, log="xy" )
> xg <- 10^seq( -.5, 5, length.out=300 )
> lines( xg, fitInfo(cds)$dispFun( xg ), col="red" , lwd=3)
> }
>
> res <- nbinomTest( cds, "D", "H" )
>
> plotDE <- function( res )
> plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col =
> ifelse( res$padj < .1, "red", "black" ) )
> plotDE( res )
>
> res
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list