[BioC] DiffBind - sample sheet for multiple replicates and peak

Rory Stark Rory.Stark at cancer.org.uk
Fri Sep 14 13:06:09 CEST 2012


Hello António-

Regarding your sample sheet, you should be able to use the same sample ID and the same read files for each sample, and specify the peak caller as MACS and QuEST as in your initial sample sheet. This should load two sets of peak for each sample, differing only peak caller name (as DiffBind doesn't recognize the MACS or QuEST strings, it will default to raw, which is the format you are using). Does this work for you? I've attached an example sample sheet using the sample tamoxifen data included with DiffBind.

The next issue is deriving "a set of common peaks for the peak callers ". You can do this using all the peaks from all the callers for all the replicates but invoking dba.count directly (by default, any peak that is identified at least twice is included; these could come form two peak callers for a single replicate, or from single peak callers for different replicates). You may want to perform an intermediate step of deriving consensus peaksets for each replicate. In the development version of DiffBind, this is easy done with the call:

data = dba.peakset(data, consensus = -DBA_CALLER)

Which will add a consensus peakset for each sample (consisting of peak identified in both peak callers for that replicate). I have added a feature to dba.count to make it easy to use only these peaksets in deriving the master consensus peakset used for counting:

data = dba.count(data, peaks=data$masks$Consensus)

But this won't be in the build for a few more days. In the meantime you have to retrieve the peakset and pass it to dba.count:

consensus = dba(data,data$masks$Consensus) # make new DBA object with only the consensus peaksets
conspeaks = dba.peakset(consensus,bRetrieve=T) # retrieve master consensus peakset
data = dba.count(data,peaks=conspeaks) # pass in master consensus peakset for counting

As Gord said in another response, the reads must be in BED or BAM format (possible zipped); you can easily convert SAM files to BAM files.

Hope this helps!

Cheers-
Rory


On 14/09/2012 11:00, "bioconductor-request at r-project.org<mailto:bioconductor-request at r-project.org>" <bioconductor-request at r-project.org<mailto:bioconductor-request at r-project.org>> wrote:


------------------------------

Message: 10
Date: Thu, 13 Sep 2012 18:06:05 +0200
From: Ant?nio Miguel de Jesus Domingues         <amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: [BioC] DiffBind - sample sheet for multiple replicates and
        peak    callers
Message-ID:
        <CAPaCvoBu89nEKmnu-2joJcZybqWdHd=ut=b5O4hL10k+axXgNQ at mail.gmail.com<mailto:CAPaCvoBu89nEKmnu-2joJcZybqWdHd=ut=b5O4hL10k+axXgNQ at mail.gmail.com>>
Content-Type: text/plain

Hi all,

I am trying to use DiffBind to compare peaks called in control vs
condition. I have 2 replicates for each and I've also called peaks using 2
different peak callers (to wi, MACS and QuEST). I've also prepared a sample
data sheet that looks like this:
SampleID   Tissue   Factor   Condition   Replicate  Peak.caller  bamReads
bamControl   Peaks
control         Hela      TF              wt              1
MACS          path             path                path
control         Hela      TF              wt              1
QuEST         path             path                path
control2       Hela      TF            wt              2
MACS          path             path                path
control 2      Hela      TF            wt              2
QuEST         path             path                path
(and the same for the conditions)

My plan was to load all the data and then using diffbind selecte a set of
common peaks for the peak callers before proceeding with the analysis.
However, when I load the data (data = dba(sampleSheet="samplesheet.csv"))
the peaks for each caller are not recognized as a different variable.  How
I can do that and is this silly?

I could also derive a set of common peaks independently but it would be
neat to do it all with the same package and that seems to be possible but I
could not find how to do it in the documentation.

Thanks,
Ant?nio


--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany

e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology

        [[alternative HTML version deleted]]



------------------------------

------------------------------

Message: 20
Date: Fri, 14 Sep 2012 11:23:26 +0200
From: Paolo Kunderfranco <paolo.kunderfranco at gmail.com<mailto:paolo.kunderfranco at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: Re: [BioC] DiffBind - sample sheet for multiple replicates
        and peak
Message-ID:
        <CAGxWFc-bYWWW9mJMR7nAdcofNLJ=ZyM3sssbc2mYNyDhE5+haQ at mail.gmail.com<mailto:CAGxWFc-bYWWW9mJMR7nAdcofNLJ=ZyM3sssbc2mYNyDhE5+haQ at mail.gmail.com>>
Content-Type: text/plain; charset=ISO-8859-1

Dear Antonio

I think that this problem was resolved in previous messagges, have a look to:

https://stat.ethz.ch/pipermail/bioconductor/2012-August/047351.html?

For the complete code you could browse August mailing list.

SampleID should be unique for each sample, and moreover also bam file
file should be unique, you should make a copy of all your bamReads and
bamControl
In my example, I compared 2 different peak callers in 3 cell lines:


SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks
mES_H3K27me3_m,ES,H3K27,mES_H3K27me3,1,reads/H3K27me3/ES_H3K27me3_m.bed,reads/H3K27me3/ES_input_m.bed,peaks/H3K27me3_ES_M.bed
CMp_H3K27me3_m,CMN,H3K27,CMp_H3K27me3,1,reads/H3K27me3/CMN_H3K27me3_m.bed,reads/H3K27me3/CMN_input_m.bed,peaks/H3K27me3_CMN_M.bed
CMa_H3K27me3_m,CMA,H3K27,CMa_H3K27me3,1,reads/H3K27me3/CMA_H3K27me3_m.bed,reads/H3K27me3/CMA_input_m.bed,peaks/H3K27me3_CMA_M.bed
mES_H3K27me3_s,ES,H3K27,mES_H3K27me3,2,reads/H3K27me3/ES_H3K27me3_s.bed,reads/H3K27me3/ES_input_s.bed,peaks/H3K27me3_ES_S.bed
CMp_H3K27me3_s,CMN,H3K27,CMp_H3K27me3,2,reads/H3K27me3/CMN_H3K27me3_s.bed,reads/H3K27me3/CMN_input_s.bed,peaks/H3K27me3_CMN_S.bed
CMa_H3K27me3_s,CMA,H3K27,CMa_H3K27me3,2,reads/H3K27me3/CMA_H3K27me3_s.bed,reads/H3K27me3/CMA_input_s.bed,peaks/H3K27me3_CMA_S.bed


Like this it should work,

Cheers,
Paolo



------------------------------

Message: 21
Date: Fri, 14 Sep 2012 11:58:09 +0200
From: Ant?nio Miguel de Jesus Domingues         <amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: [BioC] DiffBind - error in dba.count
Message-ID:
        <CAPaCvoC+9hGAXZaTgJ7J3GUKTEmaummn6SjhoXM1B4YGwVR4RQ at mail.gmail.com<mailto:CAPaCvoC+9hGAXZaTgJ7J3GUKTEmaummn6SjhoXM1B4YGwVR4RQ at mail.gmail.com>>
Content-Type: text/plain

Hi again,

I am trying DiffBind and loaded my data that looks like this:

H3K4m3
4 Samples, 13203 sites in matrix (13792 total):
         ID Tissue  Factor  Condition Peak.caller Replicate Intervals
1  wt1   Hela H3K4me3  control1         raw         1     14111
2  wt2   Hela H3K4me3  control2         raw         2     13771
3 treat1   Hela H3K4me3 condition1         raw         1     14865
4 treat2   Hela H3K4me3 condition2         raw         2     13393

But I ran into problems trying to calculate the affinity scores with
dba.count:

H3K4m3 = dba.count(H3K4m3)
Error in cond$counts : $ operator is invalid for atomic vectors
In addition: Warning message:
In mclapply(arglist, fn, ..., mc.preschedule = FALSE) :
  6 function calls resulted in an error

The peaks are in bed files (chr, start, end, score) and the reads are in
SAM format.

Can anyone help me with this?

Cheers.
Ant?nio

sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] DiffBind_1.0.9 Biobase_2.14.0

loaded via a namespace (and not attached):
[1] IRanges_1.12.6     RColorBrewer_1.0-5 amap_0.8-7
edgeR_2.4.6
[5] gdata_2.11.0       gplots_2.11.0      gtools_2.7.0
limma_3.10.3
[9] zlibbioc_1.0.1


On 13 September 2012 18:06, Ant?nio Miguel de Jesus Domingues <
amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>> wrote:

Hi all,

I am trying to use DiffBind to compare peaks called in control vs
condition. I have 2 replicates for each and I've also called peaks using 2
different peak callers (to wi, MACS and QuEST). I've also prepared a sample
data sheet that looks like this:
SampleID   Tissue   Factor   Condition   Replicate  Peak.caller  bamReads
bamControl   Peaks
control         Hela      TF              wt              1
MACS          path             path                path
control         Hela      TF              wt              1
QuEST         path             path                path
control2       Hela      TF            wt              2
MACS          path             path                path
control 2      Hela      TF            wt              2
QuEST         path             path                path
(and the same for the conditions)

My plan was to load all the data and then using diffbind selecte a set of
common peaks for the peak callers before proceeding with the analysis.
However, when I load the data (data = dba(sampleSheet="samplesheet.csv"))
the peaks for each caller are not recognized as a different variable.  How
I can do that and is this silly?

I could also derive a set of common peaks independently but it would be
neat to do it all with the same package and that seems to be possible but I
could not find how to do it in the documentation.

Thanks,
Ant?nio


--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany

e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology




--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany

e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology

        [[alternative HTML version deleted]]



------------------------------

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor


End of Bioconductor Digest, Vol 115, Issue 14
*********************************************


NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. 

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. 
Cancer Research UK
Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
A company limited by guarantee.  Registered company in England and Wales (4325234) and the Isle of Man (5713F).
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.


More information about the Bioconductor mailing list