[BioC] Most efficient way to compute width of overlap of multiple features

Charles Berry ccberry at ucsd.edu
Tue Jan 7 18:30:48 CET 2014


Vince S. Buffalo <vsbuffalo at ...> writes:

> 
> Hi Bioconductor folks,
> 
> I'm trying to create some GRanges summaries, but I think I may be missing
> an obvious solution. I have fixed-width windows as a GRanges object, and
> for each window/row I'd like to add a metadata column that indicates how
> many base pairs of this window overlap features in another GRange object.
> I'll need to add a few columns for different features in different GRanges
> objects.
> 
> I've tried using the approach of findOverlaps, followed by ranges() to
> extract range widths. This creates an error: "'query' must be a Ranges of
> length equal to number of queries". I saw in the source
> that pintersect(query[queryHits(x)], subject[subjectHits(x)]) works too
> (and does without error). This returns the overlapping ranges, but it'd
> take a load of data munging to get it into the format I'd like — it seems
> like I may be overlooking an easier solution.
> 
> thanks,
> Vince
> 
>

pintersect() seems like the right approach.

Use coverage() to normalize the subject GRanges, then tapply to add the 
bases covered:

> set.seed(12345)
> gr <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE),
+ IRanges(start=sample(100000,100),width=1000))
> gr2 <- GRanges(seqnames=sample(c("a","b"),100,repl=TRUE),
+ IRanges(start=sample(100000,100),width=1000))
> cover2.gr <- subset( as( coverage(gr2)>0, "GRanges"), score)
> seqinfo(cover2.gr) <- seqinfo(gr2)
> hits <- findOverlaps(gr,cover2.gr)
> gr.over <- pintersect(gr[queryHits(hits)],cover2.gr[subjectHits(hits)])
> gr.counts <- tapply(gr.over,queryHits(hits),FUN=function(x) sum(width(x)))
> gr$overlap<- 0
> gr$overlap[as.numeric(names(gr.counts))]<- unname(gr.counts)
> gr
GRanges with 100 ranges and 1 metadata column:
        seqnames         ranges strand   |   overlap
           <Rle>      <IRanges>  <Rle>   | <numeric>
    [1]        b [29447, 30446]      *   |         0
    [2]        b [61725, 62724]      *   |       601
    [3]        b [97426, 98425]      *   |         0
    [4]        b [61820, 62819]      *   |       696
    [5]        a [52135, 53134]      *   |       441
    ...      ...            ...    ... ...       ...
   [96]        b [  693,  1692]      *   |         0
   [97]        b [27406, 28405]      *   |         0
   [98]        b [79728, 80727]      *   |       755
   [99]        a [24344, 25343]      *   |       353
  [100]        a [45782, 46781]      *   |         0
  ---
  seqlengths:
    a  b
   NA NA
> length(findOverlaps(subset(gr,overlap==0),gr2))==0 # check
[1] TRUE
> length(findOverlaps(subset(gr,overlap!=0),gr2))>0 # check
[1] TRUE
> 

HTH,

Chuck



More information about the Bioconductor mailing list