[BioC] Low number of replicates DESeq

Federico Gaiti f.gaiti at uq.edu.au
Tue Feb 25 09:42:47 CET 2014


Hi Steve,

Please see my comments below 

Thanks
________________________________________

>Thanks for pasting the code now -- without the images or results from
>your output, it's hard to see what is reasonably correlated or not,
>but let's continue ...

>I see you're already doing PCA, and it would be interesting to see how
>your results compare to what you get by following along with the
>DESeq2 vignette and shooting your data through one of the variance
>stabilizing transforms (incidentally, the DESeq vignette also has a
>similar section, and it would have been worth while to browse that
>vignette and follow the advice outlined there as well)

This is exactly what I've done. I started doing a simple correlation anylsis and then shot my data first through DESeq and then through DESeq2.
You can find atttached some PCA plots and heatmaps when I used the option -s no for the stranded data

In DESeq2:

CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
samples<-data.frame(row.names=c("ADULT","ADULT1","ADULT2","ADULT3","JUV","JUV1","JUV2","JUV3","COMP","COMP1","COMP2","COMP3","PRECOMP","PRECOMP1","PRECOMP2","PRECOMP3"),condition=as.factor(c(rep("ADULT",4),rep("JUV",4),rep("COMP",4),rep("PRECOMP",4))))
dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=samples,design=~condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
dds<-DESeq(dds)
rld <- rlogTransformation(dds, blind=TRUE)
rlogMat <- assay(rld)
vstMat <- assay(vsd)
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
px     <- counts(dds)[,1] / sizeFactors(dds)[1]
ord    <- order(px)
ord    <- ord[px[ord] < 150]
ord    <- ord[seq(1, length(ord), length=50)]
last   <- ord[length(ord)]
vstcol <- c("blue", "black")
matplot(px[ord],cbind(assay(vsd)[, 1], log2(px))[ord, ],type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)")
legend("bottomright",legend = c(expression("variance stabilizing transformation"),expression(log[2](n/s[1]))),fill=vstcol)
library("vsn")
par(mfrow=c(1,3))
notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1),ylim = c(0,2.5))
meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))
library("RColorBrewer")
library("gplots")
select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10,6))
heatmap.2(assay(rld)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6))
heatmap.2(assay(vsd)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6))
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
print(plotPCA(rld, intgroup=c("condition")))

In DESeq:

data<-read.table("Counts",header=TRUE,row.names=1)
Design=data.frame(row.names=colnames(data),condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4','4','4'))
cds=newCountDataSet(data,Design)
cds=estimateSizeFactors(cds)
cdsBlind=estimateDispersions(cds,method='blind')
vsd=varianceStabilizingTransformation(cdsBlind)
library("RColorBrewer")
library("gplots")
select = order(rowMeans(counts(cdsBlind)), decreasing=TRUE)[1:30]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
heatmap.2(counts(cdsBlind)[select,], col = hmcol, trace="none", margin=c(10,6))
dists = dist( t( exprs(vsd) ) )
mat = as.matrix( dists )
rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
print(plotPCA(vsd, intgroup=c("condition")))

>Are you saying that if you use the stranded data but do the counting
>as if it were unstranded, it looks "just fine," but counting the
>stranded data as stranded data makes it look quite bad?

>Interesting ....I've actually never used stranded RNAseq data before,
>but that's a bit surprising ...

Exactly. it looks way better than counting the stranded data as stranded data.

>What if you only consider a subset of genes that are in regions that
>do not have annotated genes on the opposite strands -- the data for
>these genes from the stranded and unstranded should be quite similar,
>no? How do these look?

I'll have a look, good idea!

>But why? If you're just doing "vanilla DGE" and you have data that you
>feel like you can use (the unstranded data) in triplicate, and a
>source of singlicate data (the stranded one), why do you want to
>torture some DGE analysis by cramming the stranded data through it
>assuming <something> from the unstranded.

>If you really want to use the stranded data, I'd encode the "library
>type" as a covariate in the linear model while doing differential
>expression. Both the DESeq and DESeq2 vignettes have a section on
>multi-factorial designs where they incorporate data from single and
>paired-end runs to perform a DGE analysis -- the way you would combine
>unstranded and stranded counts for a gene would be rather similar.

I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization.

In DESeq:

CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
head(CountTable)
Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT","ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded"))
cdsFull=newCountDataSet(CountTable,Design)
head(cdsFull)
cdsFull=estimateSizeFactors(cdsFull)
sizeFactors(cdsFull)
cdsFull=estimateDispersions(cdsFull)
plotDispEsts(cdsFull)
fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition)
str(fit1)
fit0=fitNbinomGLMs(cdsFull,count ~ libType )
pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )
head(fit1)
cdsFullB=newCountDataSet(CountTable,Design$condition)
cdsFullB=estimateSizeFactors(cdsFullB)
cdsFullB=estimateDispersions(cdsFullB)
rs = rowSums ( counts ( cdsFull ))
theta = 0.4
use = (rs > quantile(rs, probs=theta))
table(use)
cdsFilt = cdsFull[ use, ]
fitFilt1  = fitNbinomGLMs( cdsFilt, count ~ libType + condition )
fitFilt0  = fitNbinomGLMs( cdsFilt, count ~ libType  )
pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 )
padjFilt  = p.adjust(pvalsFilt, method="BH" )
padjFiltForComparison = rep(+Inf, length(padjGLM))
padjFiltForComparison[use] = padjFilt
tab3 = table( `no filtering`   = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 )
addmargins(tab3)
res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM)
resSig=res[which(res$padjGLM<0.05),]
table(res$padjGLM<0.05)
write.csv(res,file='res_Multifactor_DESeq.csv')
hist(res$pvalsGLM,breaks=100)
resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt)
resSigFilt=resFilt[which(resFilt$padjFilt<0.05),]
table(resFilt$padjFilt<0.05)
write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv')
plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45)
h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE)
h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE)
colori = c(`do not pass`="khaki", `pass`="powderblue")
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency")
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)
legend("topright", fill=rev(colori), legend=rev(names(colori)))
cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )
vsdFull = varianceStabilizingTransformation( cdsFullBlind )
select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6))
dists = dist( t( exprs(vsdFull) ) )
mat = as.matrix( dists )
rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
print(plotPCA(vsdFull, intgroup=c("condition", "libType")))

In DESeq2:

CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
head(CountTable)
Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT","ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded"))
dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition + libType)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
design(dds)<-formula(~libType + condition )
dds<-DESeq(dds)
rld <- rlogTransformation(dds, blind=TRUE)
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
vstMat <- assay(vsd)
rlogMat <- assay(rld)
px     <- counts(dds)[,1] / sizeFactors(dds)[1]
ord    <- order(px)
ord    <- ord[px[ord] < 150]
ord    <- ord[seq(1, length(ord), length=50)]
last   <- ord[length(ord)]
vstcol <- c("blue", "black")
matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)")
legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol)
library("vsn")
par(mfrow=c(1,3))
notAllZero <- (rowSums(counts(dds))>0)
meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5))
meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))
select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6))
heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6))
heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6))
distsRL <- dist(t(assay(rld)))
select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6))
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds),
paste(condition))
rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : "))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
print(plotPCA(rld, intgroup=c("condition", "libType")))

See attached the plots I obtained (*_MULTIFACTORIAL.jpeg)

Let me know what you think

>To be honest, if I had a set of data that had three replicates per
>condition, and one set of data that had "singlicate" samples in a
>condition which came from an apparently wildly different protocol, I'd
>probably just use the triplicate data and forget about the outlier
>data if I'm just doing "vanilla DGE".

>If the stranded data is giving you wildly different results, the
>simplest explanation might be that something may have gotten screwed
>up in the library prep -- but you have no way to tell since you have
>no replicates of those data.

>As I said, you can include it in your linear model by using a
>multi-factorial design if you *really* want to use it, but why bother
>for a vanilla DGE analysis?

>On the other hand, if you're using the stranded data to do something
>more than just vanilla DGE -- perhaps you're interested in quantifying
>anti-sense transcription, then I'd try to do a lot more exploratory
>analysis to see how I could explain large difference in quantitation
>that I'm seeing. I'd still be really pissed that I had no replicates,
>but it wouldn't be the first time a graduate student is stuck with an
>underpowered dataset that they are tasked to torture, and it sadly
>won't be the last ... still, in this situation, I'd go the
>multi-factorial design approach to combine both datasets for any
>analysis I'm asked to do. In the meantime, I'd also complain to my
>advisor/collaborator whatever to twist their arm enough to run
>replicates for the future.

You got the situation quite right.
I agree with you, I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the untranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis.

Let me know if you have any further idea

Thanks a lot for help

Federico





More information about the Bioconductor mailing list