[BioC] using duplicateCorrelation with limma+voom for RNA-seq data

Wed May 21 18:43:50 CEST 2014

Hi John,

maybe I am wrong about this, but I would go for two step approach as Charity recommends.

The reasoning for this is the following:
What the voom function does is essentially to provide weights for the regression
fit. Now, in a situation where you want to estimate a correlation you first obtain "working weights" 
that are estimated WITHOUT the correlation. You use them to estimate the correlation and 
then use the voom function again, this time taking the estimated correlation into account in the
estimation of the  weights.

Now it can well be that it does not really matter for the computed weights whether the correlation is
taken into account or not (as your example seem to indicate) so a second run of voom might
not be necessary but I would always do it since it can never 'hurt'.  

The worst case is that you estimate essentially the same weights twice. (the voom function
returns them, so you can check!)

Best wishes,

Bernd

On May 21, 2014, at 5:36 PM, John Blischak <jdblischak at gmail.com> wrote:

> Hi,
> 
> I would like to use the duplicateCorrelation function along with limma and
> voom in order to analyze some RNA-seq data with blocking and group-means
> parametrization (similar to example 9.7 in the limma manual). The manual
> does not contain a similar example using the voom function for RNA-seq
> data. I searched the Bioconductor mailing list
> (query<https://encrypted.google.com/search?hl=en&q=voom%20site%3Ahttps%3A%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioconductor%2F#hl=en&q=voom+duplicatecorrelation+site:https:%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioconductor%2F>)
> and BioStars (query<https://encrypted.google.com/search?hl=en&q=voom%20duplicatecorrelation%20site%3Abiostars.org>)
> and found conflicting advice.
> 
> The first approach performs voom only once, prior to running
> duplicateCorrelation, e.g. this Bioc
> post<https://stat.ethz.ch/pipermail/bioconductor/2013-July/054087.html>and
> this Biostars
> post <https://www.biostars.org/p/54565/>. The second approach performs voom
> twice, both before and after running duplicateCorrelation, e.g. this Bioc
> post<https://stat.ethz.ch/pipermail/bioconductor/attachments/20130530/4dcc9475/attachment.pl>and
> this Biostars
> post <https://www.biostars.org/p/96889/>.
> 
> Is there a consensus on which is the better method? Does the choice depend
> on the study design?
> 
> In both my actual data and in the simulated example below, I have found the
> results to be very similar. Thus I am currently leaning towards using the
> single voom method since it achieves similar results with fewer steps. The
> simulated example is not very interesting because blocking does not have
> much effect in the first place (corfit$consensus = -0.002984798), as
> expected from random data. However, in my data set it does make a
> difference (corfit$consensus = 0.3433403) and the results of using single
> or double voom are still similar.
> 
> Thanks,
> 
> John
> 
> library("limma")
> library("edgeR")
> 
> # Simulate data
> set.seed(123)
> counts <- matrix(rpois(9000, lambda = 500), ncol = 90)
> rownames(counts) <- paste0("g_", 1:nrow(counts))
> anno <- data.frame(block = rep(paste0("block", 1:6), each = 15),
>                   fac1 = rep(c("A", "B", "C"), 30),
>                   fac2 = rep(paste0("treat", 1:5), 18))
> combined_fac <- factor(paste(anno$fac1, anno$fac2, sep = "."))
> design <- model.matrix(~0 + combined_fac)
> colnames(design) <- levels(combined_fac)
> 
> # Single voom
> y <- DGEList(counts)
> y <- calcNormFactors(y)
> v <- voom(y, design)
> corfit <- duplicateCorrelation(v, design, block = anno$block)
> fit <- lmFit(v, design, block = anno$block, correlation = corfit$consensus)
> cont_matrix <- makeContrasts(AvB_1 = A.treat1 - B.treat1,
>                             AvB_2 = A.treat2 - B.treat2, levels = design)
> fit2 <- contrasts.fit(fit, cont_matrix)
> fit2 <- eBayes(fit2)
> result_single <- topTable(fit2, number = nrow(counts), sort.by = "none")
> 
> # Double voom
> y <- DGEList(counts)
> y <- calcNormFactors(y)
> v <- voom(y, design)
> corfit <- duplicateCorrelation(v, design, block = anno$block)
> v <- voom(y, design, block = anno$block, correlation = corfit$consensus)
> fit <- lmFit(v, design, block = anno$block, correlation = corfit$consensus)
> cont_matrix <- makeContrasts(AvB_1 = A.treat1 - B.treat1,
>                             AvB_2 = A.treat2 - B.treat2, levels = design)
> fit2 <- contrasts.fit(fit, cont_matrix)
> fit2 <- eBayes(fit2)
> result_double <- topTable(fit2, number = nrow(counts), sort.by = "none")
> 
> # No blocking
> y <- DGEList(counts)
> y <- calcNormFactors(y)
> v <- voom(y, design)
> fit <- lmFit(v, design)
> cont_matrix <- makeContrasts(AvB_1 = A.treat1 - B.treat1,
>                             AvB_2 = A.treat2 - B.treat2, levels = design)
> fit2 <- contrasts.fit(fit, cont_matrix)
> fit2 <- eBayes(fit2)
> result_no_block <- topTable(fit2, number = nrow(counts), sort.by = "none")
> 
> # Comparison - all correlations > 0.99
> cor(result_single$adj.P.Val, result_double$adj.P.Val)
> cor(result_single$adj.P.Val, result_no_block$adj.P.Val)
> cor(result_double$adj.P.Val, result_no_block$adj.P.Val)
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor