[BioC] Remove batch effects from RNA-seq data using edgeR and sva/ComBat

David O'Brien dobrie10 at jhmi.edu
Tue Jun 4 21:40:20 CEST 2013


I'm trying to analyze an RNA-seq experiment where the PCA plot shows better
clustering by the day the experiment was done rather than treatment type.
Using edgeR to determine differentially expressed genes resulted in less
than 5 genes with an FDR under 5%. Creating a GLM model to remove batch
effects for day of experiment as stated in the edgeR manual resulted in 42
genes with an FDR less than 5%. An improvement, but still not good. So I
tried using ComBat and the result was 986 genes with an FDR under 5%.
Looking at the GO enrichment, the differentially expressed genes seem to
make sense, but since ComBat was developed for microarrays, I'm concerned
that there may be some caveats with this approach that I'm missing. Looking
at the top genes below, the log2 fold change is really low and generally
this just seems too good to be true. So my question is: Are there any
reasons why using ComBat with RNA-seq data is not legit? And if so, can you
see any problems with the approach below?

          mean_control mean_treatment      logFC         pval         padj
Gene5727     51.224797      45.371919 -0.1750427 3.474361e-08 0.0003224554
Gene3059      8.998311       5.740828 -0.6483954 1.056473e-07 0.0003268376
Gene11899    35.044302      39.027842  0.1553238 7.398559e-08 0.0003268376
Gene11724     2.556712       3.684178  0.5270535 1.959058e-07 0.0003636404
Gene12218    30.852989      23.702209 -0.3803888 1.908726e-07 0.0003636404

Gene4952     26.122068      30.466346  0.2219474 3.346424e-07 0.0005176360


My code is below. I've attached a file, dge.Rdata, that contains the
counts info that is output from readDGE, so you can have the initial
counts info.


require(edgeR)
require(sva)
source('code/annotate_edgeR.R')
files = data.frame(files=c('counts.control0', 'counts.control1',
'counts.control2', 'counts.control3', 'counts.treatment0',
'counts.treatment1', 'counts.treatment2', 'counts.treatment3'),
                   group=c('control', 'control', 'control', 'control',
'treatment', 'treatment', 'treatment', 'treatment'),
                   day=rep(0:3,2)
)
labels <- paste0(files$group, files$day)
dge <- readDGE(files=files, path='data/HTSeq/', labels=labels)
rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene
names to anonymize data
################################
# save(dge, file='objs/dge.Rdata')
#     SEE ATTACHED FILE     #
###############################

##  filter out the no_feature etc. rows
dge <- dge[1:(nrow(dge)-5), ]
##  This mitochondrial rRNA gene takes up a massive portion of my libraries
dge <- dge[!rownames(dge)%in%'Gene13515', ]
##  filter out lowly expressed genes
keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm
is > 1
dge <- dge[keep, ]
##  Recompute library sizes
dge$samples$lib.size <- colSums(dge$counts)
##  Normalize for lib size
dge <- calcNormFactors(dge)

## ComBat
mod <- model.matrix(~as.factor(group), data=dge$sample)
mod0 <- model.matrix(~1, data=dge$sample)
batch <- dge$sample$day

combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod)
pval_combat = f.pvalue(combat, mod, mod0)
padj_combat = p.adjust(pval_combat, method="BH")
mean_control <- rowMeans(combat[, 1:4])
mean_treatment <- rowMeans(combat[, 5:8])
logFC <- log2(mean_treatment/mean_control)

res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat,
padj=padj_combat)
res <- res[order(res$padj), ]

R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.16.0 sva_3.6.0      mgcv_1.7-23    corpcor_1.6.6
edgeR_3.2.3    limma_3.16.4

loaded via a namespace (and not attached):
[1] grid_3.0.1      lattice_0.20-15 Matrix_1.0-12   nlme_3.1-109
RCurl_1.95-4.1  tools_3.0.1
[7] XML_3.96-1.1


More information about the Bioconductor mailing list