[BioC] Gene expression analysis with edgeR with a large, nested design matrix

Tue May 27 00:34:17 CEST 2014

Dear Gordon,

I tried both (on a reduced design matrix), with quite similar results. 
The quasi-likelihood approach seems a little more conservative at least 
with borderline significant genes, but since limma/voom is still orders 
of magnitude faster, that's what I will use.

[I presume in the quasi-likelihood pipeline, the coefficients have to be 
specified in glmQLFTest() rather than topTags()]

Thanks very much for your help!
Uli

On 24/05/14 20:57, Gordon K Smyth wrote:
> Dear Uli,
>
> edgeR is probably the fastest of the glm negative binomial packages, 
> as we have done a lot of work moving all the fitting code to the C++ 
> level. Nevertheless, the ususal glm pipeline will become very slow 
> when there are over 3000 samples and the design matrix has 1500 columns.
>
> Here are two other possibilities.  First you could switch to voom 
> instead. I ran voom on your simulated data example in a few minutes on 
> my laptop computer:
>
>   y <- DGEList(counts = reads)
>   y <- calcNormFactors(y)
>   v <- voom(y,expdesign,plot=TRUE)
>   fit <- lmFit(v,expdesign)
>   fit <- eBayes(fit)
>   topTable(fit, coef=5)
>
> and so on.  Alternatively, you would stick to edgeR and use the 
> quasi-likelihood pipeline:
>
>   y <- DGEList(counts = reads)
>   y <- calcNormFactors(y)
>   y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson")
>   ql <- glmQLFTest(y,expdesign)
>   topTags(ql,coef=5)
>
> etc.  The glmQLFTest() function will take more time than voom but less 
> time than estimateGLMCcommonDispersion on your data.  The common 
> dispersion step is optional in this pipeline -- I include it because 
> you've already done it.
>
> Either of these approaches have good statistical motivations.  See the 
> help pages for voom and glmQLFTest for references to published papers 
> describing them.
>
> Best wishes
> Gordon
>
>> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT)
>> From: "Uli Braunschweig [guest]" <guest at bioconductor.org>
>> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca
>> Subject: [BioC] Gene expression analysis with edgeR with a large,
>>     nested design matrix
>>
>> Hi All,
>>
>> My problem is the following:
>> I have expression counts for 50 genes in an RNAi screen with 1,536 
>> treatments (which includes positive and negative controls, so really 
>> 1,416 unique treatments) in two replicates, done in 96-well format 
>> (2x16 plates). I know that plate effects and edge effects (whether a 
>> well was located on the edge of a plate) are significant, so the 
>> design should include treatment, plate, and location (edge or 
>> interior). Locations are identical between replicates. Each plate has 
>> two negative controls ("siNT"), as well as other controls.
>>
>> I am only interested in the contrasts of each of the treatments vs. 
>> the "siNT" control. I thought that the model of edgeR would be useful 
>> to score significant hits while at the same time dealing with the 
>> mentioned technical biases in a meaningful manner. However, I've had 
>> to kill the analysis because estimating the trended and tag-wise 
>> dispersions takes excessively long.
>>
>> My question is: Is it even feasible to try to adress a problem with a 
>> design that has so few genes and so many treatments using edgeR (or 
>> DESeq2)?
>>
>> library(edgeR)
>>
>> ## Data look similar to this:
>> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50)  # dense 
>> matrix of 50 genes x (1536 treatments in duplicate)
>>
>> ## Here is how I create my design factors:
>> # (I've left out 'replicate' because it is a linear combination of 
>> plates 1-16 and 17-32)
>> rows  <- rep(rep(1:8, each=12), 32)
>> cols  <- rep(rep(1:12, 8), 32)
>> plate <- rep(1:32, each=96)
>>
>> type.nt <-      rows == 1 & cols == 1 |   # the negative control to 
>> compare everything to; 2 per plate
>>                rows == 4 & cols == 9
>> type.posCtl <-  rows == 4 & cols == 3 |   # positive control; 2 per 
>> plate
>>                rows == 5 & cols == 9
>> type.mock <-    rows == 3 & cols == 3 |   # another control; 2 per plate
>>                rows == 8 & cols == 12
>> type.empty <-   plate %in% c(16, 32) & (  # another control; a bunch 
>> on only 2 plates
>>                rows %in% c(2:3,6:8) & cols == 9 |
>>                cols %in% c(10,11) |
>>                rows %in% 1:7 & cols == 12
>>                )
>> type.edge <- rows %in% c(1,8) | cols %in% c(1,12)  # position on the 
>> plate
>>
>> treat <- rep(paste("T", 1:1536, sep=""), 2)  # treatments
>> treat[type.nt]     <- "siNT"
>> treat[type.mock]   <- "mock"
>> treat[type.empty]  <- "empty"
>> treat[type.posCtl] <- "siPosCtl"
>> treatfac <- relevel(as.factor(treat), ref="siNT")
>>
>> edgefac  <- as.factor(ifelse(type.edge, yes="edge", no="interior"))
>> platefac <- as.factor(paste("P", plate, sep=""))
>>
>> expfact <- data.frame(treatment = treatfac,
>>                      platepos  = edgefac,
>>                      plate     = platefac
>>                      )
>>
>> expdesign <- model.matrix(formula(~ treatment + plate + platepos), 
>> data=expfact)
>>
>> ## Estimating the dispersions
>> y <- DGEList(counts = reads)  # reads is the 50x3072 matrix
>> y <- calcNormFactors(y)
>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", 
>> verbose=TRUE)  # faster than Cox-Reid and probably ok since there are 
>> many treatments
>> y <- estimateGLMTrendedDisp(y, design=expdesign)
>> y <- estimateGLMTagwiseDisp(y, design=expdesign)
>> ## Neither of the last two steps finish running in a day; same for 
>> estimateGLMCommonDisp() if method="CoxReid"
>>
>> I was then hoping to extract the contrasts of each treatment against 
>> the "siNT" control.
>> Would it make sense to combine the two technical factors, or subset 
>> the count and design matrices for each treatment in a way that 
>> reduces the number of treatments, and run them separately? 
>> Alternatively, I thought of doing the analysis separately for each 
>> treatment using the whole count matrix but amalgamating all other 
>> non-control treatments in an "other" group". This seems feasible when 
>> run in parallel, but it would be overkill...
>> Any suggestions on how to proceed?
>>
>> Kind regards,
>> Uli
>>
>> -- 
>> Ulrich Braunschweig, PhD
>>
>> The Donnelly Centre
>> University of Toronto
>> 160 College Street, Room 1030
>> Toronto, Ontario
>> Canada M5S 3E1
>>
>> u.braunschweig at utoronto.ca
>>
>> -- output of sessionInfo():
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods base
>>
>> other attached packages:
>> [1] edgeR_3.6.2  limma_3.20.4
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}