[BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes

Mon Jun 25 17:51:01 CEST 2012

Dear DEXSeq authors and users,

I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts.

To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below.

> load("toBioConductor.RData")
> library(DEXSeq)
> ls()
[1] "testgene"

###  For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N.
###  The samples were collected from two batches, So we need to adjust the batch effects in the following model.

> formu <- count ~ sample + (exon + from)*type
> testgene <- estimateDispersions(testgene, formula=formu)
> testgene <- fitDispersionFunction(testgene)
> f0 <- count~sample + from*exon + type
> f1 <- count~sample + from*exon + type*I(exon==exonID)
> testgene <- testForDEU(testgene,formula0=f0, formula1=f1)
> testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" )
> res <- DEUresultTable(testgene)

n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing.

> res["ENSG00000001497:011",]
                             geneID exonID dispersion       pvalue    padjust
ENSG00000001497:011 ENSG00000001497   E011  0.4096852 0.0006160211 0.01334712
                    meanBase log2fold(X/N)
ENSG00000001497:011 5.714405     -1.936449

We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X".

> dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG00000001497:011",],type=pData(testgene)$type)
> with(dtbl,tapply(counts,type,median))
       X        N
8.534784 2.064785

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US       LC_NUMERIC=C         LC_TIME=en_US
 [4] LC_COLLATE=en_US     LC_MONETARY=en_US    LC_MESSAGES=en_US
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_US LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DEXSeq_1.2.0   Biobase_2.14.0

loaded via a namespace (and not attached):
[1] biomaRt_2.10.0 hwriter_1.3    plyr_1.7.1     RCurl_1.91-1   statmod_1.4.14
[6] stringr_0.6    tools_2.14.0   XML_3.9-4

So, have I made any mistakes in the analysis?

Many thanks in advance,

Hui

Hui Yao, Ph.D.
Principal Statistical Analyst
MD Anderson Cancer Center

-------------- next part --------------
A non-text attachment was scrubbed...
Name: toBioConductor-plotDEXSeq.pdf
Type: application/pdf
Size: 18588 bytes
Desc: toBioConductor-plotDEXSeq.pdf
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120625/45c8527d/attachment.pdf>