[BioC] diffbind, paired end reads

Gordon Brown Gordon.Brown at cruk.cam.ac.uk
Fri Jun 21 14:45:01 CEST 2013


Hi,

Not sure what you mean by "mapping settings".  Do you mean filtering by
mapping quality?  If so, well, I've never investigated it with any care,
but we certainly do see dramatic changes in peak calling, hence also in
downstream analysis, based on inclusion or exclusion of multiply-mapped
reads.  These days I filter quite aggressively prior to peak calling:

1) Remove reads with mapping quality less than 5

2) Remove reads that overlap with ENCODE's "black listed" regions
http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability


3) Remove reads from regions that show abnormally high signal in the
input/control

Since MACS in particular depends heavily on an accurate estimate of the
insert size, taking out all these noise regions first seems to give more
reliable, consistent results, and for our work it's better to miss a few
regions in exchange for overall consistency.

But if you meant something else by "mapping settings", then ignore all
that!

Cheers,

 - Gord


On 2013-06-21 09:18, "Kasper Daniel Hansen" <kasperdanielhansen at gmail.com>
wrote:

>Both, thanks a lot for your help.  I understand the problems inherent in
>dealing with this, I was just asking how it is currently handled.  I
>suggest expanding the relevant section in the vignette, giving some more
>detail, but perhaps I am the
> only one wondering about this.
>
>
>I ended up doing a fair amount by hand myself, so I was in control.  I
>was actually surprised to learn that the mapping settings had a dramatic
>effect on the results (going from 1500 DE peaks to 5000 DE peaks) and to
>some extent the overlapping
> calculation as well.  Is this something you have investigated?
>
>
>Best,
>Kasper
>
>
>
>On Tue, Jun 18, 2013 at 10:51 AM, Gordon Brown
><Gordon.Brown at cruk.cam.ac.uk> wrote:
>
>Hi, Rory, Kasper,
>
>DiffBind's original counting code treats paired-end data the same as
>single-end, i.e. each read is considered separately, with no attention
>paid to whether it's part of a pair.  As long as all the libraries are the
>same (all S.E. or all P.E.) it shouldn't affect the outcome; counts will
>just be double for P.E. cases (disregarding improperly paired reads).
>
>To fix it properly, since we accept bed as well as bam, we'd have to test
>for properly paired reads ourselves, possibly doing it just that bit
>differently from other tools.  Personally, I'd rather not open that box...
>
>Cheers,
>
> - Gord
>
>
>On 2013-06-17 23:55, "Rory Stark" <Rory.Stark at cruk.cam.ac.uk> wrote:
>
>>Hi Kasper-
>>
>>From 1.6 on, if the bLowMem parameter is set to TRUE in dba.count,
>>DiffBind will use summarizeOverlaps. Changing the config value
>>DBA$config$singleEnd to FALSE allows it to use paired-end BAM files. The
>>documentation for summarizeOverlaps in the GenomicRanges
>> package explains how it handles paired-end data.
>>
>>Gord, how does the default count code handle paired-end data (in pre-1.6
>>DiffBind, or when bLowMem=FALSE)?
>>
>>Cheers-
>>Rory
>>
>
>>________________________________________
>>From: Kasper Daniel Hansen [kasperdanielhansen at gmail.com]
>>Sent: 17 June 2013 21:51
>>To: Rory Stark
>>Subject: diffbind, paired end reads
>>
>>
>>Hi Rory
>>
>>
>>Just skimming DiffBind.  I was surprised there is not a longer discussion
>>of dba.count() in the vignette regarding supported input formats.  I also
>>read (skimmed) the man pages.  It is not clear to me if DiffBind supports
>>paired end reads in the counting
>> step (and also how it deals with for example a paired end read where
>>only one mate aligns).
>>
>>
>>Hope all is well in Cambridge.
>>
>>
>>Best,
>>Kasper
>
>
>
>
>
>
>
>



More information about the Bioconductor mailing list