[BioC] Diffbind: Binding Affinity Heatmap

Rory Stark Rory.Stark at cruk.cam.ac.uk
Thu Aug 15 14:56:12 CEST 2013

Hi Giuseppe-

Two compare different peak callers on the same replicate, you can get the
clustering/correlation at the peak level but it doesn't make sense at the
count level, as all the peaks are merged into a single consensus set at
that point.

You did this correctly in the first case by including a line for each peak
caller with the same read (bam) files. At that point you can get a
correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by
using dba.plotVenn and/or dba.overlap).

One you create a binding matrix, as it done when you run dba.count, you
are using a single "consensus" set of peaks for all the samples, and
getting the number of reads in these peaks for each sample. So it no
longer makes sense to have a different sets of counts for each original
peakset. This is a result of the peaks being "merged" (by default, all the
peaks that appear in at least two peaksets are merged into a single set of
peaks for the rest of the analysis).

If try what you suggest, and use symbolic links, you should get exactly
the same result for each virtual replicate -- that is, the three entries
should have correlation values of 1.0, as the same reads are being counted
within the same (global, merged) consensus peakset.


On 15/08/2013 11:57, "Giuseppe Gallone" <giuseppe.gallone at dpag.ox.ac.uk>

>hm looks like I found the reason. DiffBind wants the bam files named by
>replicate, even if they're the same file. 3 symbolic links per bam did
>it. Still I wonder if this kind of analysis has any sense in your opinion.
>On 08/15/13 11:48, Giuseppe Gallone wrote:
>> Hi Rory
>> thanks for your explanation, it makes sense. I have another question if
>> you don't mind. I have a trio of samples. I don't have replicates so I
>> wanted to check the agreement of 3 different peak callers, using each
>> peak caller as a replicate. so my sampleSheet looks a bit like this
>> The samples are loaded ok with dba, and I do see some clustering by
>> replicate via a simple plot. However, I have problems with the dba.count
>> call. Basically, the correlation plot produced by dba.count only shows
>> three datapoints, corresponding to the 1st replicate (=peak caller n.1)
>> for each of the three samples. Why is that? Am I doing something wrong.
>> The only difference with your example is that, in your examples, the
>> files for the aligned reads are different - one for each replicate. Am I
>> "cheating" in using the same reads.bam file for each "replicate"?
>> Giuseppe
>> On 08/14/13 14:38, Rory Stark wrote:
>>> Hi Giuseppe-
>>> The standard heatmap plots the read densities of the differentially
>>> sites. The x-axis clusters the samples, and the y-axis clusters the
>>> based on the score for each site in each sample. The major clusters
>>> should
>>> show similar patterns of binding levels.
>>> In the example plot you sent, there are roughly three main clusters of
>>> differentially bound sites. The bottom cluster has higher binding
>>> rates in
>>> the first (red) sample group (group one gain/group two loss). The
>>> cluster includes sites with higher binding rates in the second sample
>>> group (group one loss/group two gain). The top cluster includes sites
>>> with
>>> substantial binding in both groups, but that nonetheless exhibit a
>>> significant change in binding intensity at these sites; it looks in
>>> general like these go from moderate binding in the first group to very
>>> strong binding in the second group (especially in the sample cluster on
>>> the far right).
>>> You can get a bit more contrast in these plots by using the "maxval"
>>> parameter to clip the highly-boud sites (the long tail to the right in
>>> the
>>> Color Key). For example, in this case setting maxval=6 could probably
>>> give
>>> a clearer picture of what patterns are driving the clustering of
>>> sites.
>>> Cheers-
>>> Rory
>>> On 14/08/2013 11:37, "Giuseppe Gallone"
>>><giuseppe.gallone at dpag.ox.ac.uk>
>>> wrote:
>>>> Hi Rory
>>>> I have a further question about DiffBind. Could you tell me something
>>>> more about the clustering visualisation obtained with
>>>> dba.plotHeatmap(....correlations=FALSE)? I've carried out a
>>>> analysis on my samples and I observe some interesting clustering using
>>>> both EDGER and DESEQ. I then plotted the heatmap using
>>>> correlation=FALSE.
>>>> I understand that the clustering obtained with dba.visualise is
>>>> reproduced  on the x axis (columns are grouped by clustering).
>>>> What is shown instead on the y axis? Are these the individual
>>>> differentially bound sites across the genome? What is the clustering
>>>> described on the left?
>>>> Thanks once again.
>>>> Best
>>>> Giuseppe
>>>> On 07/23/13 18:22, Rory Stark wrote:
>>>>> Hi Giuseppe-
>>>>> I'm glad to sorted the column thing out, that was what I suspected.
>>>>> There shouldn't be much problem doing the analysis without a control
>>>>> track, particularly if the samples come from the same tissue. The
>>>>> role of the control tracks is for peak calling. The reason the
>>>>> track is less important for differential analysis is that youy are
>>>>> looking
>>>>> at the relative differences in read density at the same genomic
>>>>> intervals
>>>>> across samples, and not comparing read densities across intervals.
>>>>> So if
>>>>> the control track were similar at that location for all samples, it
>>>>> will
>>>>> not affect the differential analysis. The main issue would be if
>>>>> were something like big copy number differences between samples. Then
>>>>> you
>>>>> could get sites that show as differentially bound when the real
>>>>> difference
>>>>> was the copy number. But the difference would be real regardless.
>>>>> Regarding sequencing depth, this should be taken care of by the
>>>>> normalisation step. It takes the library size (either full library
>>>>> size,
>>>>> which is the total number of reads, or the default effective library
>>>>> size,
>>>>> the number of reads within peaks for each sample) and adjusts the
>>>>> counts. You can can an idea of how this is working by using the
>>>>> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and
>>>>> bNormalized=FALSE to see if things even out. Also, after counting,
>>>>> can
>>>>> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if
>>>>> samples
>>>>> are grouping by sequencing depth -- try doing the same plots with
>>>>> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and
>>>>> see which gives to the best clustering.
>>>>> Hope this helps!
>>>>> Cheers-
>>>>> Rory
>Dr Giuseppe Gallone
>MRC career development fellow
>MRC Functional Genomics Unit - DPAG
>University of Oxford, UK

More information about the Bioconductor mailing list