[BioC] Is ComBat introducing a signal where there is none?

Vegard Nygaard vegard.nygaard at medisin.uio.no
Thu Dec 12 19:41:15 CET 2013


Hi bioconductorlist,

I am trying to combine microarray data from different experiments(batches) where some of my batches fall completely in one condition (normal/cancer). For this I try to use ComBat in the sva package.
After playing around with my data and ComBat it seemed to me that ComBat might introduce more signal in my data than there really is.
The example given by ComBat-help (?ComBat) has a batch and condition setup much like my data, so I will use that to demonstrate.
I will substitute the real data with random data ensuring no batch effect or condition effect. Then I will do a quick find-differentially-expressed-genes analysis with limma.
Contradictory to what I would expect I end up with a list of genes.

# start example code:

# Starting off from ComBat-help (?ComBat)

library(sva)
library(bladderbatch)
data(bladderdata)
   
# Obtain phenotypic data
pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
mod = model.matrix(~as.factor(cancer), data=pheno)

# using dummy data with no effects instead of real data
set.seed(100)
edata[,] = rnorm(length(edata))

# Correct for batch using ComBat
combat_edata = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE, prior.plots=FALSE)

# overview of experimental layout with regards to conditions and batches
# The "Biopsy" and "Normal" conditions seems to by most unbalanced.
table(pheno[,3:4])
	
# Using the combat corrected dummy data to search for diff genes between "Biopsy"and "Normal" with limma
library(limma)
contrast = c("Biopsy-Normal")	
fac=factor(pheno$cancer)	
design = model.matrix(~0 + fac)
colnames(design)=levels(fac)
cont.matrix = makeContrasts ( contrasts=contrast, levels=design)
	
# before ComBat
fit <- lmFit(edata, design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)	
tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999)
# top ten genes
tab[1:10,]
# no differetially exressed geens found
print( paste("DUMMY DATA --- diff genes before batchnorm. Genes with FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep=""))
	
# ComBat adjusted data
fit <- lmFit(combat_edata, design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)	
tab = topTable(fit2, adjust='fdr', coef=contrast, number=999999)
# top ten genes	
tab[1:10,]
# Many genes within 0.1 FDR. Looks similar to many microarray results. I would have reported this as a finding.
print( paste("DUMMY DATA --- diff genes after batchnorm. Genes with FDR < 0.1: " , sum(tab$adj.P.Val<0.1), sep=""))

# end example code



I also tried (not shown) to use the real example data with permuted pheno data and run it through limma. The result was no diff-genes before ComBat and a lot after.
And with fewer genes and an even more unbalanced setup I managed to make the standard hierarchical cluster heatmap before and after Combat with the samples clustering perfect according to condition afterwards(all with random start data).

My interpretation of this is that starting with a unbalanced batch/condition setup like the one in the example (and in my data) and using ComBat will almost ensure that I end up with a list of significantly differentially expressed genes regardless of the real difference.
This seems very wrong and I suspect I am missing something important. Please help me.


Thanks for your help.

Vegard Nygaard
Bioinformatics Core Facility
Department of Tumor Biology (Montebello)
Institute for Cancer Research
Oslo University Hospital
Phone 0047 22781756

>sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.18.5         bladderbatch_1.0.7   Biobase_2.22.0       BiocGenerics_0.8.0   sva_3.8.0            mgcv_1.7-26          nlme_3.1-111
[8] corpcor_1.6.6        BiocInstaller_1.12.0

loaded via a namespace (and not attached):
[1] grid_3.0.2      lattice_0.20-24 Matrix_1.0-14   tools_3.0.2



More information about the Bioconductor mailing list