[BioC] Diffbind: Binding Affinity Heatmap

Thu Aug 15 14:56:12 CEST 2013

Hi Giuseppe-

Two compare different peak callers on the same replicate, you can get the
clustering/correlation at the peak level but it doesn't make sense at the
count level, as all the peaks are merged into a single consensus set at
that point.

You did this correctly in the first case by including a line for each peak
caller with the same read (bam) files. At that point you can get a
correlation heatmap, PCA plot, etc, as well as look at overlaps (e.g. by
using dba.plotVenn and/or dba.overlap).

One you create a binding matrix, as it done when you run dba.count, you
are using a single "consensus" set of peaks for all the samples, and
getting the number of reads in these peaks for each sample. So it no
longer makes sense to have a different sets of counts for each original
peakset. This is a result of the peaks being "merged" (by default, all the
peaks that appear in at least two peaksets are merged into a single set of
peaks for the rest of the analysis).

If try what you suggest, and use symbolic links, you should get exactly
the same result for each virtual replicate -- that is, the three entries
should have correlation values of 1.0, as the same reads are being counted
within the same (global, merged) consensus peakset.

Cheers-
Rory

On 15/08/2013 11:57, "Giuseppe Gallone" <giuseppe.gallone at dpag.ox.ac.uk>
wrote:

>hm looks like I found the reason. DiffBind wants the bam files named by
>replicate, even if they're the same file. 3 symbolic links per bam did
>it. Still I wonder if this kind of analysis has any sense in your opinion.
>
>Thanks
>Giuseppe
>
>On 08/15/13 11:48, Giuseppe Gallone wrote:
>> Hi Rory
>>
>>
>> thanks for your explanation, it makes sense. I have another question if
>> you don't mind. I have a trio of samples. I don't have replicates so I
>> wanted to check the agreement of 3 different peak callers, using each
>> peak caller as a replicate. so my sampleSheet looks a bit like this
>>
>>
>> 
>>SampleID,Tissue,Factor,Condition,Treatment,Replicate,bamReads,bamControl,
>>Peaks,PeakFormat,ScoreCol,LowerBetter
>>
>> 
>>ID1.1,ID1,TF,mother,stimulated,1,ID1_reads.bam,,PeakCaller1/ID1_peaks.bed
>>,raw,5,F
>>
>> 
>>ID1.2,ID1,TF,mother,stimulated,2,ID1_reads.bam,,PeakCaller2/ID1_peaks.bed
>>,raw,5,F
>>
>> 
>>ID1.3,ID1,TF,mother,stimulated,3,ID1_reads.bam,,PeakCaller3/ID1_peaks.bed
>>,raw,5,F
>>
>> 
>>ID2.1,ID2,TF,father,stimulated,1,ID2_reads.bam,,PeakCaller1/ID2_peaks.bed
>>,raw,5,F
>>
>> 
>>ID2.2,ID2,TF,father,stimulated,2,ID2_reads.bam,,PeakCaller2/ID2_peaks.bed
>>,raw,5,F
>>
>> 
>>ID2.3,ID2,TF,father,stimulated,3,ID2_reads.bam,,PeakCaller3/ID2_peaks.bed
>>,raw,5,F
>>
>> 
>>ID3.1,ID3,TF,child,stimulated,1,ID3_reads.bam,,PeakCaller1/ID3_peaks.bed,
>>raw,5,F
>>
>> 
>>ID3.2,ID3,TF,child,stimulated,2,ID3_reads.bam,,PeakCaller2/ID3_peaks.bed,
>>raw,5,F
>>
>> 
>>ID3.3,ID3,TF,child,stimulated,3,ID3_reads.bam,,PeakCaller3/ID3_peaks.bed,
>>raw,5,F
>>
>>
>>
>> The samples are loaded ok with dba, and I do see some clustering by
>> replicate via a simple plot. However, I have problems with the dba.count
>> call. Basically, the correlation plot produced by dba.count only shows
>> three datapoints, corresponding to the 1st replicate (=peak caller n.1)
>> for each of the three samples. Why is that? Am I doing something wrong.
>> The only difference with your example is that, in your examples, the
>> files for the aligned reads are different - one for each replicate. Am I
>> "cheating" in using the same reads.bam file for each "replicate"?
>>Thanks!
>>
>> Giuseppe
>>
>>
>>
>> On 08/14/13 14:38, Rory Stark wrote:
>>> Hi Giuseppe-
>>>
>>> The standard heatmap plots the read densities of the differentially
>>>bound
>>> sites. The x-axis clusters the samples, and the y-axis clusters the
>>>sites
>>> based on the score for each site in each sample. The major clusters
>>> should
>>> show similar patterns of binding levels.
>>>
>>> In the example plot you sent, there are roughly three main clusters of
>>> differentially bound sites. The bottom cluster has higher binding
>>> rates in
>>> the first (red) sample group (group one gain/group two loss). The
>>>middle
>>> cluster includes sites with higher binding rates in the second sample
>>> group (group one loss/group two gain). The top cluster includes sites
>>> with
>>> substantial binding in both groups, but that nonetheless exhibit a
>>> significant change in binding intensity at these sites; it looks in
>>> general like these go from moderate binding in the first group to very
>>> strong binding in the second group (especially in the sample cluster on
>>> the far right).
>>>
>>> You can get a bit more contrast in these plots by using the "maxval"
>>> parameter to clip the highly-boud sites (the long tail to the right in
>>> the
>>> Color Key). For example, in this case setting maxval=6 could probably
>>> give
>>> a clearer picture of what patterns are driving the clustering of
>>>binding
>>> sites.
>>>
>>> Cheers-
>>> Rory
>>>
>>> On 14/08/2013 11:37, "Giuseppe Gallone"
>>><giuseppe.gallone at dpag.ox.ac.uk>
>>> wrote:
>>>
>>>> Hi Rory
>>>>
>>>> I have a further question about DiffBind. Could you tell me something
>>>> more about the clustering visualisation obtained with
>>>> dba.plotHeatmap(....correlations=FALSE)? I've carried out a
>>>>differential
>>>> analysis on my samples and I observe some interesting clustering using
>>>> both EDGER and DESEQ. I then plotted the heatmap using
>>>> correlation=FALSE.
>>>>
>>>> I understand that the clustering obtained with dba.visualise is
>>>> reproduced  on the x axis (columns are grouped by clustering).
>>>>
>>>> What is shown instead on the y axis? Are these the individual
>>>> differentially bound sites across the genome? What is the clustering
>>>> described on the left?
>>>>
>>>> Thanks once again.
>>>>
>>>> Best
>>>> Giuseppe
>>>>
>>>> On 07/23/13 18:22, Rory Stark wrote:
>>>>> Hi Giuseppe-
>>>>>
>>>>> I'm glad to sorted the column thing out, that was what I suspected.
>>>>>
>>>>> There shouldn't be much problem doing the analysis without a control
>>>>> track, particularly if the samples come from the same tissue. The
>>>>>main
>>>>> role of the control tracks is for peak calling. The reason the
>>>>>control
>>>>> track is less important for differential analysis is that youy are
>>>>> looking
>>>>> at the relative differences in read density at the same genomic
>>>>> intervals
>>>>> across samples, and not comparing read densities across intervals.
>>>>> So if
>>>>> the control track were similar at that location for all samples, it
>>>>> will
>>>>> not affect the differential analysis. The main issue would be if
>>>>>there
>>>>> were something like big copy number differences between samples. Then
>>>>> you
>>>>> could get sites that show as differentially bound when the real
>>>>> difference
>>>>> was the copy number. But the difference would be real regardless.
>>>>>
>>>>> Regarding sequencing depth, this should be taken care of by the
>>>>> normalisation step. It takes the library size (either full library
>>>>> size,
>>>>> which is the total number of reads, or the default effective library
>>>>> size,
>>>>> the number of reads within peaks for each sample) and adjusts the
>>>>>read
>>>>> counts. You can can an idea of how this is working by using the
>>>>> dba.plotBox (with bAll=TRUE) comparing bNormalized=TRUE and
>>>>> bNormalized=FALSE to see if things even out. Also, after counting,
>>>>>you
>>>>> can
>>>>> look at the clustering (dba.plotPCA and dba.plotHeatmap) to see if
>>>>> samples
>>>>> are grouping by sequencing depth -- try doing the same plots with
>>>>> different score, eg score=DBA_SCORE_READS, score=DBA_SCORE_RPKM, and
>>>>> score=DBA_SCORE_TMM_READS_EFFECTIVE or
>>>>> score=DBA_SCORE_TMM_READS_FULL to
>>>>> see which gives to the best clustering.
>>>>>
>>>>> Hope this helps!
>>>>>
>>>>> Cheers-
>>>>> Rory
>>>
>>
>
>-- 
>Dr Giuseppe Gallone
>MRC career development fellow
>MRC Functional Genomics Unit - DPAG
>University of Oxford, UK