[BioC] DiffBind questions

Thu Aug 29 18:55:07 CEST 2013

In answer to your first question, for the initial occupancy (peak caller)
matrix, DiffBind has always used the scores form the peak caller,
normalized to be between 0 and 1,  with a score of 1 for peaks not called
for a sample. These scores are used for the correlations. Often the
clustering at this level is driven by overall signal levels: if many of
the samples have few peaks, and some have many peaks (or vice-versa), the
ones with a many/few peaks can form their own cluster even if they don;t
really have that much in common.

As for your analysis, one observation I can make is that overall there is
lower correlation between the samples than most of the more successful
analyses I have done. This makes sense as you are comparing some quite
different transcription factors (HNF6, FOXA1, CTCF, etc) and histone marks
(H3K27ac, H34me2, H3K36me3 etc.). It is not surprising that different
normalization scores are producing different results in this case!

While I have also found that DBA_SCORE_RPKM_FOLD frequently gives a
clustering I find intuitive, the default score,
DBA_SCORE_TMM_MINUS_EFFECTIVE, only really works if a key assumption is
met: namely that the samples have roughly similar overall levels of
binding activity. In many cases, this is not the case -- especially when
you are comparing so many different factors/marks. If for example you save
a high variable in the SN scores after counting, that can be an indication
of this. I am planning on changing the default score to be
DBA_SCORE_TMM_MINUS_FULL, which normalizes to the full depth of sequencing
for each library, rather than the number of reads that overlap the
consensus peaks. I suggest you try that score and see how it compares to
RPKM_FOLD.

Also, from the labels I think perhaps you have combines replicates of
specific factors/marks? You may want to break them out so you have
replicates of each ChIP, if only as a control to see that the replicates
cluster together (with high correlation values).

Cheers-
Rory

From:  Huayun Hou <huayunhou at gmail.com>
Date:  Wed, 28 Aug 2013 16:46:38 -0400
To:  Rory Stark <rory.stark at cruk.cam.ac.uk>
Subject:  Re: DiffBind questions

Hi Dr. Stark,

I've been using DiffBind for my analysis and I have a question about the
occupancy analysis in DiffBind. After reading in the dba object from
sample sheet, a correlation heatmap can be generated base on peaks set
data given. Command as follows:

>Data <- dba(sampleSheet="sampleSheet.csv")

>plot(Data)

To my understanding, in the previous version, only the existence of peaks
are taken into account, which means essentially the heatmap was made by
pair-wise correlation of vectors consist of 0 and 1. But in the new
version (1.6.2) I found that the vectors get from peak sets contains also
a number (rather than 0 and 1). I might lost when reading the manual but i
was wondering if peak score given in the peak sets are also taken into
account when the correlation is made?

I was trying to understand the global binding correlations between all the
factors I'm interested in so I did the correlation heatmap base on the
read count. After dba.count, similar correlation heatmaps are made, but by
using the following commands, that is using different scores I got very
different correlations.

dba.plotHeatmap(data_count, ColAttributes=NULL, correlations=T,
socre=RPKM_FOLD)
dba.plotHeatmap(data_count, ColAttributes=NULL, correlations=T)
                        (default socre method)

I attached the heatmap generated in each step. Files labelled with "TMM"
and "RPKMfold" correspond to default score method and RPKM_FOLD. The one
uses RPKM_FOLD makes more sense to me. The mouse liver chip-seq data was
taken from Fuare et. al (2012) and also generated from Odam lab.

I wonder if you could given me any suggestions or commands on why the
results look different and which one makes more sense in terms of my
analysis. 

Thanks a lot and please let me know if anything is not clear.

Best Regards,
Huayun Hou

2013/5/4 Rory Stark <Rory.Stark at cruk.cam.ac.uk>

Hello-

1. There are two ways to use less memory in dba.count. The best way is to
update to the latest version of DiffBind in Bioconductor 2.12. dba.count
now has a parameter to do the counting with less memory (but a bit
slower): bLowMem = TRUE.
The other option is to run dba.count serially so it uses all available
memory for each file. You can set bParallel=FALSE to do this -- it is a
lot slower though! You can also try to lower the number of cores being
used when it runs in parallel, e.g.

> DBA$config$cores = 2

2. To get the numbers of peaks in each overlap, you need to set
bCorOnly=FALSE. If there is a place in the documentation where I missed
this, could you point it out to me so I can fix it?

3. In the case of plotting peaks (without counts), DiffBind first merges
all the overlapping peaks to form a master peak list. To construct the
vector for each peakset, it assigns a value of -1 for each peak that was
not called in that peakset, and the
 score specified by the peak caller (normalized to be between 0 and 1) for
peaks that are present. (If more than one peak got merged together for the
peakset, it uses the highest score). As you say, it then computes the
correlation between each pair of vectors
 for the heatmap. 

Once you have run dba.count, it uses scores based on the counts. Counts
are all set to a minimum of 1 (in the case where there are no reads or
more reads in the control than the experiment). If there are negative
correlation values, this is not because
 there are negative counts, but because the peak vectors are negatively
correlated. If you have peaks with low values for all the samples, you can
filter them out using the filter option in dba.count (you can do this
without re-counting by setting peaks=NULL).
 The latest version has more advanced filtering options than the version
you are using.

The issue with the merge procedure making the peaks very wide, especially
in the case where you are looking at a lot of different marks
simultaneously, is a legitimate one. You may want to compare the
distribution of peak widths before and after the merge..
 We have run successful analyses looking at very wide regions of
enrichment (e.g. 500kb windows in Chandra et. al. Molecular Cell 2012). If
you really want to look at different patterns of marks within a region
(e.g. a promoter), you may want to take a look
 at the MMDiff package -- you can use your existing DiffBind objects in
MMDiff.

4. Currently DiffBind will merge together any peaks that overlap by at
least one base, and you can not change it. Actually, internally, we do
have a parameter controlling how many bases the overlap should be (which
can also be used to merge peaks that
 are close but not actually overlapping), but we have not exposed it in
the interface. We are looking at features that enable you to have better
control the merging function. I could probably expose the "gap" parameter
pretty easily in the development version
 and make that available to you quickly if you are interested.

Do say hello to Mike for me!

Cheers-
Rory
________________________________________
From: Huayun Hou [huayunhou at gmail.com]
Sent: 03 May 2013 14:00
To: rory.stark at cancer.org.uk
Subject: DiffBind questions

Dear Dr. Stark,

I'm a master student in Dr. Michael Wilson's lab. We've been using
DiffBind to analyze our ChIP-seq data. Thanks for this very handy tools.
But I have a few questions regarding some of the functions.
1. Memory allocation error when using dba.count()
I have a dataset of 22 samples, each around 10-15 million reads. When I
use dba.count() I always get error: cannot allocate vector of size ...
I'm using R on cluster, sessionInfo()
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DiffBind_1.2.4       Biobase_2.16.0       GenomicRanges_1.8.13
[4] IRanges_1.14.4       BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] amap_0.8-7         edgeR_2.6.12       gdata_2.12.0       gplots_2.11.0

[5] gtools_2.7.0       limma_3.12.3       RColorBrewer_1.0-5 stats4_2.15.0

[9] zlibbioc_1.2.0 

I've asked for the newest version of DiffBind but I wonder if you could
give any suggestions on working with large dataset with DiffBind. Do I
need do some settin g in R or I just need more local memory, for example,
run on a super computer.

2. dba.overlap() doesn't give peak numbers.
I did:  dba.overlap(test,c(1,2),mode=DBA_OLAP_ALL)

the result:  A         B     onlyA     onlyB     inAll       Cor   Overlap
1.0000000 2.0000000 0.0000000 0.0000000 0.0000000 0.9999698 0.0000000
It doesn't give the number of peaks as described in the manual.

3. How is the correlation calculated?
When plotting the correlation heatmap using
dba(sampleSheet)
plot()
My understanding is the package merges all the peak regions from all the
samples to generate a master peakset, and ask if each peak region in each
sample is in that master peakset or not. Then calculate the correlation
between 2 vectors of 0 and
 1 and plot the heatmap with this correlation matrix.
One concern for me is I'm comparing different factors which may or may not
have different binding profiles. Will the merging step generate too broad
peak regions which just cover everything? Or if the master peaks region
has too many peaks and
 for all the samples there are too many 0s in their peaks vector. Will
they be correlated because of regions where there are no peaks instead of
where there are peaks?
Also, is the dba.count() and dba.analyze() using the same approach? Then
there may be some regions for some samples with a negative RPKM-cRPKM
because those regions are not called "peaks" originally. Then the result
correlation matrix may have
 some negative correlation values in it. I wonder if that comparison is
fair in my case (I pooled all biological replicates before loading
datasets to DiffBind, so I'm comparing biding between different vectors
without replicates) ?

4. Can the overlap threshold be changed?
Currently I think the package uses minoverlap 1bp as the threshold for
called 2 peaks overlap. Can I change this threshold for my own convenience
?

Thanks again for the great tool! I'm looking forward for your reply.

Best Regards,
Huayun Hou

-- 
Huayun Hou
Department of Molecular Genetics
University of Toronto

B.Sc. Life Science and Psychology
Peking University

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named
person(s). If you are not the intended recipient, notify the sender
immediately, delete this email from your system and do not disclose or use
for any purpose.

We may monitor all incoming and outgoing emails in line with current
legislation. We have taken steps to ensure that this email and attachments
are free from any virus, but it remains your responsibility to ensure that
viruses do not adversely affect you.

Cancer Research UK
Registered charity in England and Wales (1089464), Scotland (SC041666) and
the Isle of Man (1103)
A company limited by guarantee. Registered company in England and Wales
(4325234) and the Isle of Man (5713F).
Registered Office Address: Angel Building, 407 St John Street, London EC1V
4AD.

-- 
Huayun Hou
Department of Molecular Genetics
University of Toronto

B.Sc. Life Science and Psychology
Peking University