[BioC] bug in estimateDisp() or mistake in documentation?

Tue Jul 22 21:41:08 CEST 2014

Hi Gordon,

Thanks for the clarification. Yes, the documentation should be reworded - I did understand from it that the GLM dispersion estimates were slightly different in this function because the help page says it is "similar to" the trio of GLM dispersion functions, but the qCML method says it "is the same as". Therefore, I thought it was a deliberate distinction between the two.

Jenny

-----Original Message-----
From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] 
Sent: Tuesday, July 15, 2014 8:23 PM
To: Zadeh, Jenny Drnevich
Cc: bioconductor at r-project.org
Subject: Re: bug in estimateDisp() or mistake in documentation?

Hi Jenny,

The estimateDisp() function is correct.  It does in fact use the qCML dispersion estimates when there is no design matrix, and it does this even though a design matrix is computed internally.  However it doesn't give exactly the same estimates as estimateCommonDisp and estimateTagwiseDisp because it is a complete rewrite of the approach with somewhat different internal grid parameter settings.  I agree that this last point is potentially confusing and is not adequately explained in the documentation.  Our intention actually is to consolidate the functions so that estimateDisp() is always the engine even when estimateTagwiseDisp() is called.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Tue, 15 Jul 2014, Zadeh, Jenny Drnevich wrote:

> Hello,
>
> I've found either a bug in estimateDisp() or a mistake in the 
> documentation about when it uses the classic qCML vs. the GLM CR 
> dispersion estimates. The help page for estimateDisp() says

> "If there is no design matrix, it calculates the quantile conditional 
> likelihood for each gene (tag) and then maximize it. The method is 
> same as in the function estimateCommonDisp and estimateTagwiseDisp".

> However, when you call estimateDisp() without specifying a design 
> matrix but with a DGEList object with more than one level in 
> samples$group, it internally calculates a design matrix and gives you 
> common/trended/tagwise dispersion estimates. Actually, even when 
> samples$group has only one level, estimateDisp still internally 
> creates a unit vector design matrix and outputs common/trended/tagwise 
> dispersion estimates. So is it a bug in the code or a mistake in the 
> documentation and estimateDisp() can't be used to get the same results 
> as estimateCommonDisp() and estimateTagwiseDisp()? (example below)
>
> It was my understanding that for a single factor experiment (2 or more 
> groups), the classic qCML was preferred over the GLM dispersion.
> However, if you have more than 2 groups and you want to calculate a 
> oneway ANOVA or specialized contrasts, it seems you have to use the 
> GLM approach. Is the GLM approach therefore becoming the best method 
> for all experimental designs, even those with only a single factor?
>
> Thanks,
> Jenny
>
> #example - data as indicated in edgeRUsersGuide() section 4.3
>> library(edgeR)
> Loading required package: limma
>> shell.exec(system.file("doc","edgeRUsersGuide.pdf",package="edgeR"))
>>
>> targets <- readTargets()
>> targets
>     Lane Treatment Label
> Con1    1   Control  Con1
> Con2    2   Control  Con2
> Con3    2   Control  Con3
> Con4    4   Control  Con4
> DHT1    5       DHT  DHT1
> DHT2    6       DHT  DHT2
> DHT3    8       DHT  DHT3
>> x <- read.delim("pnas_expression.txt", row.names=1, 
>> stringsAsFactors=FALSE) y <- DGEList(counts=x[,1:7], 
>> group=targets$Treatment, genes=data.frame(Length=x[,8]))
>> colnames(y) <- targets$Label
>> keep <- rowSums(cpm(y)>1) >= 3
>> y <- y[keep,]
>> y$samples$lib.size <- colSums(y$counts) y <- calcNormFactors(y) 
>> y$samples
>       group lib.size norm.factors
> Con1 Control   976847    1.0296636
> Con2 Control  1154746    1.0372521
> Con3 Control  1439393    1.0362662
> Con4 Control  1482652    1.0378383
> DHT1     DHT  1820628    0.9537095
> DHT2     DHT  1831553    0.9525624
> DHT3     DHT   680798    0.9583181
>>
>> y2 <- y
>>
>> y <- estimateCommonDisp(y, verbose=TRUE)
> Disp = 0.02002 , BCV = 0.1415
>> y <- estimateTagwiseDisp(y)
>> y2 <- estimateDisp(y2)
> Loading required package: splines
>>
>> names(y)
> [1] "counts"             "samples"            "genes"
> [4] "common.dispersion"  "pseudo.counts"      "pseudo.lib.size"
> [7] "AveLogCPM"          "prior.n"            "tagwise.dispersion"
>> names(y2)
> [1] "counts"             "samples"            "genes"
> [4] "common.dispersion"  "trended.dispersion" "trend.method"
> [7] "AveLogCPM"          "span"               "tagwise.dispersion"
> [10] "prior.df"           "prior.n"
>>
>> all.equal(y$tagwise.dispersion, y2$tagwise.dispersion)
> [1] "Mean relative difference: 0.1156227"
>>
>> et <- exactTest(y)
>> summary(decideTestsDGE(et))
>   [,1]
> -1  2096
> 0  12059
> 1   2339
>>
>> et2 <- exactTest(y2)
>> summary(decideTestsDGE(et2))
>   [,1]
> -1  2078
> 0  12084
> 1   2332
>>
>>
>> sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
> States.1252 [3] LC_MONETARY=English_United States.1252 [4] 
> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] locfit_1.5-9.1       edgeR_3.6.6          limma_3.20.8
> [4] BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
> [1] grid_3.1.1      lattice_0.20-29 tools_3.1.1
>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist High Performance 
> Biological Computing Program and The Roy J. Carver Biotechnology 
> Center University of Illinois, Urbana-Champaign NOTE NEW OFFICE 
> LOCATION
> 2112 IGB
> 1206 W. Gregory Dr.
> Urbana, IL 61801
> USA
> ph: 217-300-6543
> fax: 217-265-5066
> e-mail: drnevich at illinois.edu<mailto:drnevich at illinois.edu>
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}