[BioC] nearly exact overlaps between GAlignmentPairs and GenomicRanges

Michael Lawrence michafla at gene.com
Fri Aug 22 06:48:33 CEST 2014


The summarizeOverlaps function supports passing a function as the "mode"
argument, which needs to take the two sets of ranges as input and return a
vector of counts. So the problem reduces to counting the "within" overlaps
between a GRanges and a GAlignmentPairs, where the range of interest should
only be slightly larger than the paired read alignment.

The closest built-in overlap mode to this is probably type="equal" and
maxgap=2L. But it's not quite right, because that mode is symmetric in its
tolerance, i.e., the read could be slightly larger than the range of
interest. The other problem is that for some reason GAlignmentPairs does
not yet support type="equal", but that is actually solvable by reducing the
GAlignmentPairs to a GRanges via granges(galp). We can do that in this
case, since the ranges of interest are gapless ranges.

But to get exactly what you want, we need to use type="within" with
findOverlaps() and then filter the hits for cases where the overhang is
below some tolerance (3bp):

reads <- granges(galp) # simplifies things a bit
hits <- findOverlaps(reads, ranges, type="within")
overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)])
hits <- hits[overhang < 3L]

Now we need the count vector:

counts <- countSubjectHits(hits)

Then it's just a simple exercise to insert that code into a function that
conforms to the expectations of the "mode" parameter.

Hope that helps,
Michael

On Thu, Aug 21, 2014 at 12:59 PM, Georg Otto <georg.otto at imm.ox.ac.uk>
wrote:

>
> Dear all,
>
> I have a problem finding nerly exact overlaps between read pairs and
> genomic regions. I try this by using GAlignmentPairs and GenomicRanges:
>
> > library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam",
> > package="Rsamtools") galp <- readGAlignmentPairs(ex1_file,
> use.names=TRUE)
> > galp[3:4]
>
> GAlignmentPairs with 2 alignment pairs and 0 metadata columns:
>                        seqnames strand : ranges -- ranges
>                           <Rle> <Rle> : <IRanges> -- <IRanges>
> EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262]
>    B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259]
> ---
> seqlengths:
>  seq1 seq2 1575 1584
>
>
> > ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263,
> 260)))
> > ranges
> GRanges with 2 ranges and 0 metadata columns:
>       seqnames    ranges strand
>          <Rle> <IRanges>  <Rle>
>   [1]     seq1 [50, 263]      *
>   [2]     seq1 [57, 260]      *
>   ---
>   seqlengths:
>    seq1
>      NA
> > summary <-summarizeOverlaps(ranges, galp[3:4], mode =
> "IntersectionStrict",
> +                             inter.feature = FALSE, fragments = FALSE)
> Warning message:
> In if (all(strand(reads) == "*")) ignore.strand <- TRUE :
>   the condition has length > 1 and only the first element will be used
> > assays(summary)$counts
>      reads
> [1,]     2
> [2,]     1
>
>
> I am not entirely sure what the warning means, but my point is a
> different one: I use the "IntersectStrict" mode because I only want to
> count read
> pairs that are completely within the interval. This is what I
> get, but I want to use a more strict selection for "almost equal",
> i.e. intervals that are the same size or only slightly larger (say 3 bp)
> than the range of the read pairs. So ranges[1] should count galp[3], but
> not galp[4]
>
> Any hint will be appreciated....
>
> Best wishes,
>
> Georg
>
> > sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.14.3     Biostrings_2.30.1    GenomicRanges_1.14.4
> [4] XVector_0.2.0        IRanges_1.20.7       BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6   compiler_3.0.1 stats4_3.0.1   tools_3.0.1
> zlibbioc_1.8.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

	[[alternative HTML version deleted]]



More information about the Bioconductor mailing list