[BioC] DiffBind - overlap between different peak callers

Paolo Kunderfranco paolo.kunderfranco at gmail.com
Thu Aug 9 09:34:26 CEST 2012


Dear Rory,
Thanks again for Your input, I went through and finally obtained
consensus peakset  identified by both peak callers for
all the samples:



rm(list=ls())
library("DiffBind")
setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")

test=dba(sampleSheet="database.csv")
#clustering based on occupancy/peak calling
pdf('clustering based on occupancy_peak calling.pdf')
dba.plotHeatmap(test)
dba.plotPCA(test)
dev.off()

par(mfrow=c(1,3))

pdf('VENN_H3K27me3.pdf')
dba.plotVenn(test, test$masks$ES  &
test$masks$H3K27,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K27,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K27,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K4me3.pdf')
dba.plotVenn(test, test$masks$ES  &
test$masks$H3K4,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K4,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K4,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K9me3.pdf')
dba.plotVenn(test, test$masks$ES  &
test$masks$H3K9,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K9,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K9,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K79me2.pdf')
dba.plotVenn(test, test$masks$ES  &
test$masks$H3K79,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K79,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K79,label1='CMa_m',label2='CMa_s')
dev.off()
test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
minOverlap=2, sampID="mES_H3K27me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
minOverlap=2, sampID="CMp_H3K27me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
minOverlap=2, sampID="CMa_H3K27me3")
pdf('Overlap_H3K27me3.pdf')
dba.plotVenn(test, test$masks$H3K27 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES  & test$masks$H3K9,
minOverlap=2, sampID="mES_H3K9me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
minOverlap=2, sampID="CMp_H3K9me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
minOverlap=2, sampID="CMa_H3K9me3")
pdf('Overlap_H3K9me3.pdf')
dba.plotVenn(test, test$masks$H3K9 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES  & test$masks$H3K79,
minOverlap=2, sampID="mES_H3K79me2")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
minOverlap=2, sampID="CMp_H3K79me2")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
minOverlap=2, sampID="CMa_H3K79me2")
pdf('Overlap_H3K79me2.pdf')
dba.plotVenn(test, test$masks$H3K79 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES  & test$masks$H3K4,
minOverlap=2, sampID="mES_H3K4me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
minOverlap=2, sampID="CMp_H3K4me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
minOverlap=2, sampID="CMa_H3K4me3")
pdf('Overlap_H3K4me3.pdf')
dba.plotVenn(test, test$masks$H3K4 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()

consdata = dba(test, mask = test$masks$Consensus)
> consdata
12 Samples, 9754 sites in matrix (16143 total):
             ID Tissue Factor    Condition Replicate Intervals
1  mES_H3K27me3     ES  H3K27 mES_H3K27me3       1-2      1333
2  CMp_H3K27me3    CMN  H3K27 CMp_H3K27me3       1-2       911
3  CMa_H3K27me3    CMA  H3K27 CMa_H3K27me3       1-2       316
4   mES_H3K9me3     ES   H3K9  mES_H3K9me3       1-2        89
5   CMp_H3K9me3    CMN   H3K9  CMp_H3K9me3       1-2        86
6   CMa_H3K9me3    CMA   H3K9  CMa_H3K9me3       1-2       323
7  mES_H3K79me2     ES  H3K79 mES_H3K79me2       1-2      8590
8  CMp_H3K79me2    CMN  H3K79 CMp_H3K79me2       1-2      4641
9  CMa_H3K79me2    CMA  H3K79 CMa_H3K79me2       1-2      1084
10  mES_H3K4me3     ES   H3K4  mES_H3K4me3       1-2     10018
11  CMp_H3K4me3    CMN   H3K4  CMp_H3K4me3       1-2      8748
12  CMa_H3K4me3    CMA   H3K4  CMa_H3K4me3       1-2      2998

consdata = dba.count(consdata, minOverlap=1)

After I call dba.count I am prompted out R:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::compare
Abort

Where am I wrong?
Cheers,
Paolo






2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
> Hello Paolo-
>
> To examine the overlaps between peak callers, you need to be working with
> the peak (occupancy) data BEFORE you settle on a global peakset used for
> counting and subsequent analysis. In your example below, you have this in
> "test" after calling "dba" but before calling "dba.count".
>
> So after
>
>> test=dba(sampleSheet="database.csv")
>
> you can check clustering based on occupancy/peak calling:
>
>> dba,plotHeatmap(test)
>> dba.plotPCA(test)
>
>
> You can also plot Venn diagrams of each of the peak caller overlaps -- eg
> for the H3K27me3 mark, you could see the three sample overlaps as follows:
>
>> par(mfrow=c(1,3))
>> dba.plotVenn(test, test$masks$ES  & test$masks$H3K27)
>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>
>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>
>
> Note that specifying minOverlap=2 to dba.count doesn't limit the overlap
> to replicate peak callers. Rather, it creates a single set of peaks that
> are identified at least twice amongst all the peaksets: either by both
> peak callers for a single sample, or by the same peak caller in different
> samples,  by different peak callers in different samples. If you want to
> use only peaks called by both peak callers for each sample,you can add a
> separate consensus peakset for eac condition, eg:
>
>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>>minOvrlap=2, ID="mES_H3K27me3")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>minOverlap=2, ID="CMp_H3K27e3")
>
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>minOverlp=2, ID="CMa_H3K27me3")
>
>
> repeated 9 more times for a total of 12 pairs. You could if you like see
> the peak overlaps of the three H3K27me3 samples at this point:
>
>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>
> Once you have added all 12 consensus peaksets, create a new DBA object
> containing only the consensus peaksets:
>
>> consdata = ba(test, mask = test$masks$Consensus)
>
> Then, if you want to use all the peaks identified by both peak callers for
> all the samples:
>
>> consdata = dba.count(consdata, minOverlap=1)
>
> And continue with the differential analysis as you had done.
>
> Cheers-
> Rory
>
>>
>>Dear All,
>>I am using package DiffBind,
>>Due to the fact that I miss real biological or technical replicates, I
>>am assesing the overlapping rates between different peak callers.
>>Here is how I set up my database to read in:
>>
>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>
>>> rm(list=ls())
>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>> test=dba(sampleSheet="database.csv")
>>> test
>>24 Samples, 19380 sites in matrix (40809 total):
>>                  ID Tissue Factor    Condition Replicate Intervals
>>1  mES_H3K27me3_m     ES  H3K27 mES_H3K27me3         1      1834
>>2  CMp_H3K27me3_m    CMN  H3K27 CMp_H3K27me3         1      2450
>>3  CMa_H3K27me3_m    CMA  H3K27 CMa_H3K27me3         1       990
>>4  mES_H3K27me3_s     ES  H3K27 mES_H3K27me3         2      2188
>>5  CMp_H3K27me3_s    CN  H3K27 CMp_H3K27me3         2      3388
>>6  CMa_H3K27me3_s    CMA  H3K27 CMa_H3K27me3         2      5946
>>7   mES_H3K4me3_m     ES   H3K4  mES_H3K4me3         1     10243
>>8   CMp_H3K4me3_m    CMN   H3K4  CMp_H3K4me3         1     12476
>>9   CMa_H3K4me3_m    CMA   H3K4  CMa_H3K4me3         1      5632
>>10   mES_H3K4m3_s     ES   H3K4  mES_H3K4me3         2     14917
>>11  CMp_H3K4me3_s    CMN   H3K4  CMp_H3K4me3         2     10646
>>12  CMa_H3K4me3_s    CMA   H3K4  CMa_H3K4me3         2      5985
>>13  mES_H3K9me3_m     ES   H3K9  mES_H3K9me3         1       110
>>14  CMp_H3K9me3_m    CMN   H3K9  CMp_H3K9me3         1       484
>>15  CMa_H3K9me3_m    CMA   H3K9  CMa_H3K9me3         1       938
>>16  mES_H3K9me3_s     ES   H3K9  mES_H3K9me3         2       569
>>17  CMp_H3K9me3_s    CMN   H3K9  CMp_H3K9me3         2       808
>>18  CMa_H3K9me3_s    CMA   H3K9  CMa_H3K9me3         2      3747
>>19 mES_H3K79me2_m     ES  H3K79 mES_H3K79me2         1     21318
>>20 CMp_H3K79me2_m    CMN  H3K79 CMp_H3K79me2         1     13210
>>21 CMa_H3K79me2_m    CMA  H3K79 CMa_H3K79me2         1      2988
>>22 mES_H3K79me2_s     ES  H3K79 mES_H3K79me2         2     22164
>>23 CMp_H3K79me2_s    CMN  H3K79 CMp_H3K79me2         2     13429
>>24 CMa_H3K79me2_s    CMA  H3K79 CMa_H3K79me2         2      6063
>>
>>
>>> test=dba.count(test)
>>
>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>#minMemebers was se to 2 due to
>>the fact I have 2 replicates for each condition
>>#....78 Contrasts:
>>
>>> test = dba.analyze(test)
>>
>>#for each condition generate a report
>>>test.DB1=dba.report(test,contrast=1)
>>
>>
>>I would like now to assess overlapping rates between replicates, and
>>generate a venn diagram.
>>but it seems like replicates are merged together and I cannot treat
>>them as single samples.
>>I noticed this behaviour also when i try to generate both a PCA plot
>>and a heatmap plot, replicates are merged, so
>>what's the utility to display both in the PCA or heatmap plot?
>>would it be simple to display the original replicates to assess how do
>>they cluster?
>>
>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","anti
>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown","
>>>coral"))
>>>dev.off()
>>
>>>pdf('HeatMap.pdf')
>>>corvals=dba.plotHeatmap(test)
>>>dev.off()
>>
>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>dev.off()
>>
>>> sessionInfo()
>>R version 2.15.1 (2012-06-22)
>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C                 LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>attached base packages:
>>[1] parallel  stats     graphics  grDevices utils     datasets  methods
>>[8] base
>>
>>other attached packages:
>>[1] DiffBind_1.2.0      Biobase_2.16.0      GenomicRanges_1.8.9
>>[4] IRanges_1.14.4      BiocGenerics_0.2.0
>>
>>loaded via a namespace (and not attached):
>> [1] amap_0.8-7         edgeR_2.6.10       gdata_2.11.0
>>gplots_2.11.0
>> [5] gtools_2.7.0       limma_3.12.1       RColorBrewer_1.0-5
>>stats4_2.15.1
>> [9] tools_2.15.1       zlibbioc_1.2.0
>>
>>
>>
>>Any sugestion is appreciated,
>>cheers,
>>paolo
>>
>>



More information about the Bioconductor mailing list