[BioC] nearly exact overlaps between GAlignmentPairs and GenomicRanges

Georg Otto georg.otto at imm.ox.ac.uk
Thu Aug 21 21:59:45 CEST 2014


Dear all,

I have a problem finding nerly exact overlaps between read pairs and
genomic regions. I try this by using GAlignmentPairs and GenomicRanges:

> library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam",
> package="Rsamtools") galp <- readGAlignmentPairs(ex1_file, use.names=TRUE)
> galp[3:4]

GAlignmentPairs with 2 alignment pairs and 0 metadata columns:
                       seqnames strand : ranges -- ranges
                          <Rle> <Rle> : <IRanges> -- <IRanges>
EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262]
   B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259]
---
seqlengths:
 seq1 seq2 1575 1584


> ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, 260)))
> ranges
GRanges with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1 [50, 263]      *
  [2]     seq1 [57, 260]      *
  ---
  seqlengths:
   seq1
     NA
> summary <-summarizeOverlaps(ranges, galp[3:4], mode = "IntersectionStrict",
+                             inter.feature = FALSE, fragments = FALSE)
Warning message:
In if (all(strand(reads) == "*")) ignore.strand <- TRUE :
  the condition has length > 1 and only the first element will be used
> assays(summary)$counts
     reads
[1,]     2
[2,]     1


I am not entirely sure what the warning means, but my point is a
different one: I use the "IntersectStrict" mode because I only want to count read
pairs that are completely within the interval. This is what I
get, but I want to use a more strict selection for "almost equal",
i.e. intervals that are the same size or only slightly larger (say 3 bp)
than the range of the read pairs. So ranges[1] should count galp[3], but
not galp[4]

Any hint will be appreciated....

Best wishes,

Georg

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsamtools_1.14.3     Biostrings_2.30.1    GenomicRanges_1.14.4
[4] XVector_0.2.0        IRanges_1.20.7       BiocGenerics_0.8.0  

loaded via a namespace (and not attached):
[1] bitops_1.0-6   compiler_3.0.1 stats4_3.0.1   tools_3.0.1    zlibbioc_1.8.0



More information about the Bioconductor mailing list