[BioC] bug in estimateDisp() or mistake in documentation?
Zadeh, Jenny Drnevich
drnevich at illinois.edu
Tue Jul 22 21:41:08 CEST 2014
Hi Gordon,
Thanks for the clarification. Yes, the documentation should be reworded - I did understand from it that the GLM dispersion estimates were slightly different in this function because the help page says it is "similar to" the trio of GLM dispersion functions, but the qCML method says it "is the same as". Therefore, I thought it was a deliberate distinction between the two.
Jenny
-----Original Message-----
From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU]
Sent: Tuesday, July 15, 2014 8:23 PM
To: Zadeh, Jenny Drnevich
Cc: bioconductor at r-project.org
Subject: Re: bug in estimateDisp() or mistake in documentation?
Hi Jenny,
The estimateDisp() function is correct. It does in fact use the qCML dispersion estimates when there is no design matrix, and it does this even though a design matrix is computed internally. However it doesn't give exactly the same estimates as estimateCommonDisp and estimateTagwiseDisp because it is a complete rewrite of the approach with somewhat different internal grid parameter settings. I agree that this last point is potentially confusing and is not adequately explained in the documentation. Our intention actually is to consolidate the functions so that estimateDisp() is always the engine even when estimateTagwiseDisp() is called.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Tue, 15 Jul 2014, Zadeh, Jenny Drnevich wrote:
> Hello,
>
> I've found either a bug in estimateDisp() or a mistake in the
> documentation about when it uses the classic qCML vs. the GLM CR
> dispersion estimates. The help page for estimateDisp() says
> "If there is no design matrix, it calculates the quantile conditional
> likelihood for each gene (tag) and then maximize it. The method is
> same as in the function estimateCommonDisp and estimateTagwiseDisp".
> However, when you call estimateDisp() without specifying a design
> matrix but with a DGEList object with more than one level in
> samples$group, it internally calculates a design matrix and gives you
> common/trended/tagwise dispersion estimates. Actually, even when
> samples$group has only one level, estimateDisp still internally
> creates a unit vector design matrix and outputs common/trended/tagwise
> dispersion estimates. So is it a bug in the code or a mistake in the
> documentation and estimateDisp() can't be used to get the same results
> as estimateCommonDisp() and estimateTagwiseDisp()? (example below)
>
> It was my understanding that for a single factor experiment (2 or more
> groups), the classic qCML was preferred over the GLM dispersion.
> However, if you have more than 2 groups and you want to calculate a
> oneway ANOVA or specialized contrasts, it seems you have to use the
> GLM approach. Is the GLM approach therefore becoming the best method
> for all experimental designs, even those with only a single factor?
>
> Thanks,
> Jenny
>
> #example - data as indicated in edgeRUsersGuide() section 4.3
>> library(edgeR)
> Loading required package: limma
>> shell.exec(system.file("doc","edgeRUsersGuide.pdf",package="edgeR"))
>>
>> targets <- readTargets()
>> targets
> Lane Treatment Label
> Con1 1 Control Con1
> Con2 2 Control Con2
> Con3 2 Control Con3
> Con4 4 Control Con4
> DHT1 5 DHT DHT1
> DHT2 6 DHT DHT2
> DHT3 8 DHT DHT3
>> x <- read.delim("pnas_expression.txt", row.names=1,
>> stringsAsFactors=FALSE) y <- DGEList(counts=x[,1:7],
>> group=targets$Treatment, genes=data.frame(Length=x[,8]))
>> colnames(y) <- targets$Label
>> keep <- rowSums(cpm(y)>1) >= 3
>> y <- y[keep,]
>> y$samples$lib.size <- colSums(y$counts) y <- calcNormFactors(y)
>> y$samples
> group lib.size norm.factors
> Con1 Control 976847 1.0296636
> Con2 Control 1154746 1.0372521
> Con3 Control 1439393 1.0362662
> Con4 Control 1482652 1.0378383
> DHT1 DHT 1820628 0.9537095
> DHT2 DHT 1831553 0.9525624
> DHT3 DHT 680798 0.9583181
>>
>> y2 <- y
>>
>> y <- estimateCommonDisp(y, verbose=TRUE)
> Disp = 0.02002 , BCV = 0.1415
>> y <- estimateTagwiseDisp(y)
>> y2 <- estimateDisp(y2)
> Loading required package: splines
>>
>> names(y)
> [1] "counts" "samples" "genes"
> [4] "common.dispersion" "pseudo.counts" "pseudo.lib.size"
> [7] "AveLogCPM" "prior.n" "tagwise.dispersion"
>> names(y2)
> [1] "counts" "samples" "genes"
> [4] "common.dispersion" "trended.dispersion" "trend.method"
> [7] "AveLogCPM" "span" "tagwise.dispersion"
> [10] "prior.df" "prior.n"
>>
>> all.equal(y$tagwise.dispersion, y2$tagwise.dispersion)
> [1] "Mean relative difference: 0.1156227"
>>
>> et <- exactTest(y)
>> summary(decideTestsDGE(et))
> [,1]
> -1 2096
> 0 12059
> 1 2339
>>
>> et2 <- exactTest(y2)
>> summary(decideTestsDGE(et2))
> [,1]
> -1 2078
> 0 12084
> 1 2332
>>
>>
>> sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
> States.1252 [3] LC_MONETARY=English_United States.1252 [4]
> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] locfit_1.5-9.1 edgeR_3.6.6 limma_3.20.8
> [4] BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
> [1] grid_3.1.1 lattice_0.20-29 tools_3.1.1
>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist High Performance
> Biological Computing Program and The Roy J. Carver Biotechnology
> Center University of Illinois, Urbana-Champaign NOTE NEW OFFICE
> LOCATION
> 2112 IGB
> 1206 W. Gregory Dr.
> Urbana, IL 61801
> USA
> ph: 217-300-6543
> fax: 217-265-5066
> e-mail: drnevich at illinois.edu<mailto:drnevich at illinois.edu>
>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list