[BioC] limma differences with and without an intercept

Gordon K Smyth smyth at wehi.EDU.AU
Mon Feb 10 23:55:01 CET 2014


Dear Rory,

> Date: Sun,  9 Feb 2014 13:44:50 -0800 (PST)
> From: "Rory Kirchner [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, rory.kirchner at gmail.com
> Subject: [BioC] limma differences with and without an intercept
>
>
> When setting up a model in limma, if I include or exclude an intercept 
> term, I get slightly different results.

There are a few situations in a which contrasts.fit() uses an 
approximation to avoid having to refit the linear model for each gene. 
The help page for contrasts.fit gives the following warning:

"Warning. For efficiency reasons, this function does not re-factorize the 
design matrix for each probe. A consequence is that, if the design matrix 
is non-orthogonal and the original fit included quality weights or missing 
values, then the unscaled standard deviations produced by this function 
are approximate rather than exact. The approximation is usually 
acceptable. The results are always exact if the original fit was a oneway 
model."

This approximation has generally been fine in the past with microarray 
data, but the voom weights are pushing the approximation to the point that 
it is having a noticeable effect on some analyses.  We are planning to 
rewrite the linear model code in the limma to eliminate the approximation 
by storing more information from each fit.

> For example here is a simplified 
> version of my model, which uses a paired design, each patient is 
> represented twice with a measurement when sick and when recovered:
>
> design = model.matrix(~ patient + status,  data=samples_norm)
> colnames(design)
> [1] "(Intercept)"        "patientA010"        "patientA011"
> [4] "patientA019"        "patientA024"        "patientA030"
> [7] "patientA031"        "patientA034"        "patientA036"
> [10] "patientA041"        "patientA055"        "patientA067"
> [13] "patientA073"        "patientA080"        "statusexacerbation"
>
> The first patient is patient A004. If I run an analysis and compare patient A010 to the baseline (patientA004):
>
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> fit = eBayes(fit)
> a = rownames(topTable(fit, coef=2, n=Inf, p.value=0.05))
> length(a)
> [1] 70

This result is exact.

> If I do the same analysis, but use an intercept term I get a slightly different result:
> design = model.matrix(~ 0 + patient + status,
>    data=samples_norm)
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> cm = makeContrasts(
>    onepatient=patientA010-patientA004,
>    levels=design)
> fit = contrasts.fit(fit, cm)
> fit = eBayes(fit)
> b = rownames(topTable(fit, n=Inf, p.value=0.05))
> length(b)
> [1] 72

This result is approximate.

>> length(intersect(a, b))
> [1] 69
>
> This does not occur using the pickrell1 dataset.

Because the design matrix is orthogonal in that case.  The approximation 
only arises when there is an additive term (+ status in your case).

Best wishes
Gordon

> At first I thought this 
> might just be random due to rounding of the p-values, but one of them is 
> not really close to the cutoff so that isn't it. Should I not be 
> expecting these two ways of parameterizing the model to give the same 
> results?
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin13.0.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines   grid      parallel  stats     graphics  grDevices utils
> [8] datasets  methods   base
>
> other attached packages:
> [1] Vennerable_3.0            gtools_3.2.1
> [3] RBGL_1.38.0               graph_1.40.1
> [5] CHBUtils_0.1              tweeDEseqCountData_1.0.10
> [7] sva_3.8.0                 statmod_1.4.18
> [9] rpart_4.1-5               RColorBrewer_1.0-5
> [11] mgcv_1.7-28               nlme_3.1-113
> [13] DESeq_1.14.0              lattice_0.20-24
> [15] locfit_1.5-9.1            corpcor_1.6.6
> [17] BiocInstaller_1.12.0      affyPLM_1.38.0
> [19] preprocessCore_1.24.0     gcrma_2.34.0
> [21] affy_1.40.0               gplots_2.12.1
> [23] biomaRt_2.18.0            knitr_1.5
> [25] devtools_1.4.1            reshape_0.8.4
> [27] plyr_1.8                  DESeq2_1.2.10
> [29] RcppArmadillo_0.4.000.2   Rcpp_0.11.0
> [31] GenomicRanges_1.14.4      XVector_0.2.0
> [33] IRanges_1.20.6            vsn_3.30.0
> [35] gridExtra_0.9.1           ggplot2_0.9.3.1
> [37] HTSFilter_1.2.1           Biobase_2.22.0
> [39] BiocGenerics_0.8.0        edgeR_3.4.2
> [41] limma_3.18.11             googleVis_0.4.7
> [43] xtable_1.7-1              extrafont_0.16
> [45] dplyr_0.1.1
>
> loaded via a namespace (and not attached):
> [1] affyio_1.30.0        annotate_1.40.0      AnnotationDbi_1.24.0
> [4] assertthat_0.1       Biostrings_2.30.1    bitops_1.0-6
> [7] caTools_1.16         codetools_0.2-8      colorspace_1.2-4
> [10] compiler_3.0.2       DBI_0.2-7            dichromat_2.0-0
> [13] digest_0.6.4         evaluate_0.5.1       extrafontdb_1.0
> [16] formatR_0.10         gdata_2.13.2         genefilter_1.44.0
> [19] geneplotter_1.40.0   gtable_0.1.2         httr_0.2
> [22] KernSmooth_2.23-10   labeling_0.2         MASS_7.3-29
> [25] Matrix_1.1-2         memoise_0.1          munsell_0.4.2
> [28] proto_0.3-10         RCurl_1.95-4.1       reshape2_1.2.2
> [31] RJSONIO_1.0-3        RSQLite_0.11.4       Rttf2pt1_1.2
> [34] scales_0.2.3         stats4_3.0.2         stringr_0.6.2
> [37] survival_2.37-7      tools_3.0.2          whisker_0.3-2
> [40] XML_3.98-1.1         zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list