[BioC] DiffBind (Please add me to the Dropbox containing the vignette data)

Fri Feb 14 18:28:25 CET 2014

Dear Rory, 

Thanks a lot for the elaborate and clear explanation! 
It helps a lot for understanding your pipeline. 

Our experimental layout is as follows: 
Clinical epigenetic data from 5 human patients, before and after treatment was collected individually, per each patient. 

The summary table for DiffBind (following the example in your tutorial) should be like this:

ID         Tissue     Factor        Condition           Replicate
aaa-p    NA         H3K9ac       Pre-treatment     1
bbb-p   NA         H3K9ac       Pre-treatment     2
ccc-p     NA         H3K9ac       Pre-treatment     3
ddd-p   NA         H3K9ac       Pre-treatment     4
eee-p    NA         H3K9ac       Pre-treatment     5
aaa-t     NA         H3K9ac       Treatment           1
bbb-t    NA         H3K9ac       Treatment           2
ccc-t      NA         H3K9ac       Treatment          3
ddd-t    NA         H3K9ac       Treatment          4
eee-t     NA         H3K9ac       Treatment          5

So basically we have for each individual patient (for example patient "aaa" ) his analysis data from before and after treatment - (aaa-p and aaa-t). 
But there are no repeats per each of the patients. 
However, since we determined the status of each factor (for example, H3K9ac) across multiple patients we should be able to refer to each of the five patients as a replica. 

My question is this - and I would be very glad to get your current input on: 
>From what I understand from your tutorial and emails - running DiffBind on this dataset would basically treat it like two batches (two groups) - 
Is there currently a way to perform a pair-wise analysis to firstly compare aaa-p with aaa-t, then bbb-p with bbb-t, and so on, and only then draw the statistical analysis? So something like pairwise t-test as oppose to two groups t-test?...  In case there is currently a way to do it with DiffBind I would be glad to learn it from you. In addition, I would like to get your opinion on what value should be set for the minOverlap parameter? Would you recommend setting up to "=1" to allow exploration of peaks that are condition specific ? on the other hand would these "singleton" would ever be reported significant, given that they deferentially deposited only in one sample out of 10?... 

Finally, 
When I come to read your first two lines (here again):
"When counting reads overlapping an interval, DiffBind sets the value to a
minimum of one to eliminate any issues created by having zero values."

I am still confused - in the scenario of two samples only, each from different condition - when an enriched peak is detected only in one condition (where it has for example 456 tags), but it is NOT detected in the other condition and has simply zero tags there (so not even one tag!) - would this 'standing alone' peak be ignored by DiffBind (for inability to drive a statistical calculation)...? Is there currently a way to obtain from DiffBind a list of all these condition-specific peaks that do not meet even one tag in the corresponding condition? (in our hands we sometime find that there are quite few cases like this and we would not like to ignore them). 

I'll be really glad to get your professional input on these crucial issues. 
We are trying to decide whether to use DiffBind for our project and these aspects should be regarded. 

Thank you very much, 
Roy

--
Roy Blum, Ph.D.
Senior Research Scientist
Cancer Institute, Smilow Research Building,
New York University School of Medicine,
12th Floor, Room 1206
552 First Ave.
New York, NY, 10016
Mob:   +1 (646)-716-2875
Lab:    +1 (212)-263-2327
http://blumroy.googlepages.com

________________________________________
From: Rory Stark [Rory.Stark at cruk.cam.ac.uk]
Sent: Friday, February 14, 2014 11:32 AM
To: Blum, Roy
Cc: Gordon Brown; bioconductor at r-project.org
Subject: Re: DiffBind (Please add me to the Dropbox containing the vignette data)

Hi Roy-

When counting reads overlapping an interval, DiffBind sets the value to a
minimum of one to eliminate any issues created by having zero values.

The minOverlap parameter in dba.count includes all peaks that occur in at
least this many peaksets, regardless of if they are in replicates or
different conditions. So in the case case where there is only one sample
for each condition, minOverlap=2would eliminate peaks that appear in only
one condition. But if you had two replicates of each condition,
minOverlap=2 would include peaks identified in only one conditions so long
as they were identified in both replicates.

Currently DiffBind merges peaks that overlap by at least 1bp. The ability
to control that (e.g. 50%) has been a requested feature in the past --
actually internally, the overlapping code does handle different
overlapping percentages (including negative values for peaks near to each
other but not actually overlapping). We will consider adding this feature
in a future release.

Cheers-
Rory

On 13/02/2014 22:08, "Blum, Roy" <Roy.Blum at nyumc.org> wrote:

>Dear Rory,
>
>Thanks a lot for your clarifying response!
>It helps a lot for understanding your pipeline.
>
>If I understand correctly - since dba.report calculates fold changes by
>computing log2 normalized counts in the first condition minus the log2
>normalized counts in the second condition (across each of the peaks
>presented by the two conditions - in case that minOverlap was set as
>"=1") - then even in the case of 'condition-exclusive' peaks (with zero
>tags in the peak location) we would still get a fold-change value, simply
>since we'll have a log2-normalized value minus zero, which would be equal
>to the log2 normalized value. Am I correct on this? This aspect wasn't
>very clear..
>
>In addition, if I understand correctly - in case of using minOverlap=2
>(for analysis that employs one sample per each condition, across two
>conditions) would tell DiffBind to ignore all the condition-exclusive
>peaks and to perform calculations only on the overlapping peaks? Am I
>correct on this?
>
>Finally, how does DiffBind define overlapping peaks? Is there a way to
>redefine this criteria? (for example based on overlap of 1bp vs. overlap
>of 50% of each peak span, etc.)
>
>Thanks a lot!!
>Roy
>
>--
>Roy Blum, Ph.D.
>Senior Research Scientist
>Cancer Institute, Smilow Research Building,
>New York University School of Medicine,
>12th Floor, Room 1206
>552 First Ave.
>New York, NY, 10016
>Mob:   +1 (646)-716-2875
>Lab:    +1 (212)-263-2327
>http://blumroy.googlepages.com
>
>________________________________________
>From: Rory Stark [Rory.Stark at cruk.cam.ac.uk]
>Sent: Thursday, February 13, 2014 3:26 PM
>To: Blum, Roy
>Cc: Gordon Brown; bioconductor at r-project.org
>Subject: Re: Please add me to the Dropbox containing the vignette data
>
>Hi Roy-
>
>First, I am obliged to discourage you from doing this type of analysis
>without replicates, for two reasons: 1) it is not good science, as
>biological and experimental variability is high in these types of
>experiments, and your samples may not be representative; and 2) because
>the statistical techniques that DiffBind relies on (embodied in the edgeR,
>DESeq, and DESeq2 packages) require replication to properly calculate
>confidence statistics.
>
>Technically, DiffBind will handle this comparison. You may want to do some
>simpler overlaps (dba.plotVenn, dba.overlap) to detect regions identified
>as enriched in only one condition. If you want to compute fold changes
>based on read counts, you can call dba.count with minOverlap=1, which will
>include all the called peaks including those that do not overlap. Then set
>up a contrast using dba.contrast with one condition as group1 and the
>other as group2 (you will be warned again about the lack of replication).
>You can call dba.analyze (again, the underlying method is likely to issue
>a warning relating to the lack of replication) to do the comparison, then
>call dba.report with th=1 to get all the fold changes, computed as the
>log2 normalized counts in the first condition minus the log2 normalized
>counts in the second condition for each interval. This report will also
>include confidence statistics that you probably shouldn't take very
>seriously for the reasons described above.
>
>Cheers-
>Rory
>
>On 13/02/2014 19:16, "Blum, Roy" <Roy.Blum at nyumc.org> wrote:
>
>>Dear Gord and Rory,
>>
>>I am exploring your DiffBind software and would like to inquire regarding
>>the following -
>>
>>I would refer to a very simple scenario in which DiffBind is loaded with
>>data of one histone mark tested across two conditions - before and after
>>treatment (no replicates for any of the conditions).
>>
>>Would it be still possible to draw the basic analysis presented in the
>>tutorial?
>>
>>In general -  would condition-specific peaks (that do not overlap with a
>>corresponding peak in the other condition) be still considered as part of
>>the statistical analysis performed by DiffBind? Or, does the statistical
>>analysis limited only to the 'shared peaks' and reports on affinity
>>changes only within 'shared' peaks (which shared within the two
>>conditions)?
>>Is there a way that DiffBind can report on all the condition-exclusive
>>peaks (ones that are deposited only in one condition but have zero
>>deposition in the other?) - how would the fold change difference be
>>calculated in such events?
>>
>>
>>Thanks a lot!
>>Roy
>>--
>>Roy Blum, Ph.D.
>>Senior Research Scientist
>>Cancer Institute, Smilow Research Building,
>>New York University School of Medicine,
>>12th Floor, Room 1206
>>552 First Ave.
>>New York, NY, 10016
>>Mob:   +1 (646)-716-2875
>>Lab:    +1 (212)-263-2327
>>http://blumroy.googlepages.com
>>
>>________________________________________
>>From: Blum, Roy
>>Sent: Thursday, February 13, 2014 10:01 AM
>>To: Gordon Brown
>>Subject: RE: Please add me to the Dropbox containing the vignette data
>>
>>Hi Gord,
>>
>>Thanks for you reply and for the wonderful DiffBind tool!
>>
>>I've got the link for the data files from Rory by now.
>>Btw, this is the link:
>>https://www.dropbox.com/s/bqxnqhvr7sol1za/DiffBindVignette.zip
>>in case that someone inquires for it in the future.
>>
>>Best wishes!
>>Roy
>>
>>--
>>Roy Blum, Ph.D.
>>Senior Research Scientist
>>Cancer Institute, Smilow Research Building,
>>New York University School of Medicine,
>>12th Floor, Room 1206
>>552 First Ave.
>>New York, NY, 10016
>>Mob:   +1 (646)-716-2875
>>Lab:    +1 (212)-263-2327
>>http://blumroy.googlepages.com
>>
>>________________________________________
>>From: Gordon Brown [Gordon.Brown at cruk.cam.ac.uk]
>>Sent: Thursday, February 13, 2014 9:24 AM
>>To: Blum, Roy
>>Subject: Re: Please add me to the Dropbox containing the vignette data
>>
>>Hi, Roy,
>>
>>Sorry for the slow response.  As far as I know, the data should be
>>publicly visible, so I suspect the error was just a transient error.  Can
>>you re-try?  (Or maybe Rory has already responded, in which case ignore
>>this...).
>>
>>Cheers,
>>
>> - Gord
>>
>>
>>On 2014-02-10 18:11, "Blum, Roy" <Roy.Blum at nyumc.org> wrote:
>>
>>>Dear Gordon,
>>>
>>>
>>>I am currently interested in learning how to use your DiffBind software.
>>>
>>>
>>>Would you kindly add me to the Dropbox containing the vignette data?
>>>
>>>
>>>My attempt to execute the command line:
>>>source(file.path(system.file("extra",
>>>package="DiffBind"),"tamoxifen_GEO.R"))
>>>failed ....
>>>
>>>Here's the output which was plotted on my R screen:
>>>Thanks a lot in advance!  (Rory Stark seems to be away..)
>>>
>>>
>>>Roy Blum
>>>
>>>The email address which I use for my Dropbox activity is:
>>>blumroy at gmail.com (please add this email address as well!, Thanks!)
>>>
>>>
>>>
>>>
>>>
>>>> source(file.path(system.file("extra",
>>>>package="DiffBind"),"tamoxifen_GEO.R"))
>>>Loading required package: Biobase
>>>Welcome to Bioconductor
>>>
>>>
>>>    Vignettes contain introductory material; view with
>>>'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")',
>>>and
>>>for
>>>    packages 'citation("pkgname")'.
>>>
>>>
>>>
>>>
>>>Attaching package: ŒBiobase¹
>>>
>>>
>>>The following object is masked _by_ Œ.GlobalEnv¹:
>>>
>>>
>>>    exprs
>>>
>>>
>>>Setting options('download.file.method.GEOquery'='auto')
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798430/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798430/suppl//GSM79
>>>8
>>>4
>>>30_SLX-2645.443.s_5_SLX-2577.443.s_8_peaks.txt.gz'
>>>ftp data connection made, file length 889489 bytes
>>>opened URL
>>>downloaded 868 Kb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798431/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798431/suppl//GSM79
>>>8
>>>4
>>>31_SLX-2576.443.s_7_SLX-2577.443.s_8_peaks.txt.gz'
>>>ftp data connection made, file length 863440 bytes
>>>opened URL
>>>downloaded 843 Kb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798443/suppl/"
>>>No supplemental files found
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798440/suppl/"
>>>No supplemental files found
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798423/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798423/suppl//GSM79
>>>8
>>>4
>>>23_SLX-2640.438.s_1_SLX-2574.433.s_2_peaks.txt.gz'
>>>ftp data connection made, file length 1566858 bytes
>>>opened URL
>>>downloaded 1.5 Mb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798424/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798424/suppl//GSM79
>>>8
>>>4
>>>24_SLX-2773.448.s_1_SLX-2574.433.s_2_peaks.txt.gz'
>>>ftp data connection made, file length 1047867 bytes
>>>opened URL
>>>downloaded 1023 Kb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798425/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798425/suppl//GSM79
>>>8
>>>4
>>>25_SLX-2943.469.s_2_SLX-2574.433.s_2_peaks.txt.gz'
>>>ftp data connection made, file length 1436673 bytes
>>>opened URL
>>>downloaded 1.4 Mb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798428/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798428/suppl//GSM79
>>>8
>>>4
>>>28_SLX-2775.448.s_3_T47D_Input_peaks.txt.gz'
>>>ftp data connection made, file length 621444 bytes
>>>opened URL
>>>downloaded 606 Kb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798429/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798429/suppl//GSM79
>>>8
>>>4
>>>29_SLX-2867.466.s_6_T47D_Input_peaks.txt.gz'
>>>ftp data connection made, file length 508000 bytes
>>>opened URL
>>>downloaded 496 Kb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798442/suppl/"
>>>No supplemental files found
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798432/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798432/suppl//GSM79
>>>8
>>>4
>>>32_SLX-3229.521.s_5_SLX-1651.307.s_1_peaks.txt.gz'
>>>ftp data connection made, file length 1099858 bytes
>>>opened URL
>>>downloaded 1.0 Mb
>>>
>>>
>>>[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798433/suppl/"
>>>trying URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798433/suppl//GSM79
>>>8
>>>4
>>>33_SLX-3230.526.s_4_SLX-3231.526.s_5_peaks.txt.gz'
>>>Error in download.file(file.path(url, i), destfile = file.path(storedir,
>>>:
>>>  cannot open URL
>>>'ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM798nnn/GSM798433/suppl//GSM79
>>>8
>>>4
>>>33_SLX-3230.526.s_4_SLX-3231.526.s_5_peaks.txt.gz'
>>>
>>>
>>>
>>>--
>>>Roy Blum, Ph.D.
>>>Senior Research Scientist
>>>Cancer
>>> Institute, Smilow Research Building,
>>>New York University School of Medicine,
>>>12th Floor, Room 1206
>>>552 First Ave.
>>>New York, NY, 10016
>>>Mob:   +1 (646)-716-2875
>>>Lab:    +1 (212)-263-2327
>>>http://blumroy.googlepages.com <http://blumroy.googlepages.com/>
>>> <http://blumroy.googlepages.com/>
>>>
>>>
>>>________________________________________
>>>From: Rory Stark [Rory.Stark at cruk.cam.ac.uk]
>>>Sent: Monday, February 10, 2014 11:39 AM
>>>To: Blum, Roy
>>>Subject: Automatic reply: Please add me to the Dropbox containing the
>>>vignette data
>>>
>>>
>>>I am out of the office until 3 January. If it is urgent, please contact
>>>Matt Eldridge.
>>>
>>>
>>>
>>>
>>>
>>>------------------------------------------------------------
>>>This email message, including any attachments, is for the sole use of
>>>the
>>>intended recipient(s) and may contain information that is proprietary,
>>>confidential, and exempt from disclosure under applicable law. Any
>>>unauthorized review, use, disclosure, or distribution
>>> is prohibited. If you have received this email in error please notify
>>>the sender by return email and delete the original message. Please note,
>>>the recipient should check this email and any attachments for the
>>>presence of viruses. The organization accepts no
>>> liability for any damage caused by any virus transmitted by this email.
>>>=================================
>>>
>>>
>>
>>
>>------------------------------------------------------------
>>This email message, including any attachments, is for the sole use of the
>>intended recipient(s) and may contain information that is proprietary,
>>confidential, and exempt from disclosure under applicable law. Any
>>unauthorized review, use, disclosure, or distribution is prohibited. If
>>you have received this email in error please notify the sender by return
>>email and delete the original message. Please note, the recipient should
>>check this email and any attachments for the presence of viruses. The
>>organization accepts no liability for any damage caused by any virus
>>transmitted by this email.
>>=================================
>>
>
>
>------------------------------------------------------------
>This email message, including any attachments, is for the sole use of the
>intended recipient(s) and may contain information that is proprietary,
>confidential, and exempt from disclosure under applicable law. Any
>unauthorized review, use, disclosure, or distribution is prohibited. If
>you have received this email in error please notify the sender by return
>email and delete the original message. Please note, the recipient should
>check this email and any attachments for the presence of viruses. The
>organization accepts no liability for any damage caused by any virus
>transmitted by this email.
>=================================
>

------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================