[BioC] nearly exact overlaps between GAlignmentPairs and GenomicRanges

Valerie Obenchain vobencha at fhcrc.org
Fri Aug 22 18:06:29 CEST 2014


Hi Georg,

I noticed your R and Bioconductor packages are out of date - you may 
want to update.

Once you update you'll see that summarizeOverlaps() has moved from 
GenomicRanges to GenomicAlignments package. I've fixed the 
strand-related warning you see below in GenomicAlignments release 
(1.0.6) and devel (1.1.26).

Let us know if you have trouble wrapping Michael's suggestion into a 
mode function.

Valerie



On 08/21/2014 09:48 PM, Michael Lawrence wrote:
> The summarizeOverlaps function supports passing a function as the "mode"
> argument, which needs to take the two sets of ranges as input and return a
> vector of counts. So the problem reduces to counting the "within" overlaps
> between a GRanges and a GAlignmentPairs, where the range of interest should
> only be slightly larger than the paired read alignment.
>
> The closest built-in overlap mode to this is probably type="equal" and
> maxgap=2L. But it's not quite right, because that mode is symmetric in its
> tolerance, i.e., the read could be slightly larger than the range of
> interest. The other problem is that for some reason GAlignmentPairs does
> not yet support type="equal", but that is actually solvable by reducing the
> GAlignmentPairs to a GRanges via granges(galp). We can do that in this
> case, since the ranges of interest are gapless ranges.
>
> But to get exactly what you want, we need to use type="within" with
> findOverlaps() and then filter the hits for cases where the overhang is
> below some tolerance (3bp):
>
> reads <- granges(galp) # simplifies things a bit
> hits <- findOverlaps(reads, ranges, type="within")
> overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)])
> hits <- hits[overhang < 3L]
>
> Now we need the count vector:
>
> counts <- countSubjectHits(hits)
>
> Then it's just a simple exercise to insert that code into a function that
> conforms to the expectations of the "mode" parameter.
>
> Hope that helps,
> Michael
>
> On Thu, Aug 21, 2014 at 12:59 PM, Georg Otto <georg.otto at imm.ox.ac.uk>
> wrote:
>
>>
>> Dear all,
>>
>> I have a problem finding nerly exact overlaps between read pairs and
>> genomic regions. I try this by using GAlignmentPairs and GenomicRanges:
>>
>>> library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam",
>>> package="Rsamtools") galp <- readGAlignmentPairs(ex1_file,
>> use.names=TRUE)
>>> galp[3:4]
>>
>> GAlignmentPairs with 2 alignment pairs and 0 metadata columns:
>>                         seqnames strand : ranges -- ranges
>>                            <Rle> <Rle> : <IRanges> -- <IRanges>
>> EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262]
>>     B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259]
>> ---
>> seqlengths:
>>   seq1 seq2 1575 1584
>>
>>
>>> ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263,
>> 260)))
>>> ranges
>> GRanges with 2 ranges and 0 metadata columns:
>>        seqnames    ranges strand
>>           <Rle> <IRanges>  <Rle>
>>    [1]     seq1 [50, 263]      *
>>    [2]     seq1 [57, 260]      *
>>    ---
>>    seqlengths:
>>     seq1
>>       NA
>>> summary <-summarizeOverlaps(ranges, galp[3:4], mode =
>> "IntersectionStrict",
>> +                             inter.feature = FALSE, fragments = FALSE)
>> Warning message:
>> In if (all(strand(reads) == "*")) ignore.strand <- TRUE :
>>    the condition has length > 1 and only the first element will be used
>>> assays(summary)$counts
>>       reads
>> [1,]     2
>> [2,]     1
>>
>>
>> I am not entirely sure what the warning means, but my point is a
>> different one: I use the "IntersectStrict" mode because I only want to
>> count read
>> pairs that are completely within the interval. This is what I
>> get, but I want to use a more strict selection for "almost equal",
>> i.e. intervals that are the same size or only slightly larger (say 3 bp)
>> than the range of the read pairs. So ranges[1] should count galp[3], but
>> not galp[4]
>>
>> Any hint will be appreciated....
>>
>> Best wishes,
>>
>> Georg
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.14.3     Biostrings_2.30.1    GenomicRanges_1.14.4
>> [4] XVector_0.2.0        IRanges_1.20.7       BiocGenerics_0.8.0
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6   compiler_3.0.1 stats4_3.0.1   tools_3.0.1
>> zlibbioc_1.8.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158



More information about the Bioconductor mailing list