[BioC] overlap regions between two GRanges (or GRangesList)

Fri Mar 14 23:48:43 CET 2014

On 03/14/2014 03:31 PM, Zhao, Shanrong [JRDUS] wrote:
> Thank you. subsetByOverlaps is not what I want.  I just want to keep the
> ranges in 'g' that overlaps with g2, but only keep *"overlapping"
> regions* (instead of the original ranges in 'g1').

Thanks for clarifying. But that still doesn't explain why you want
to *completely* get rid of the 1st range in 'g' (the range on chr1).

Note that in the general case, when you replace the original range with
the overlapping region, you might need more than 1 range for that.
That's because the original range in 'g' can overlap with more than 1
range in 'g2'. Assuming any given range in 'g' overlaps with at most
1 range in 'g2':

   > ov <- findOverlaps(g, g2)
   > pintersect(g[queryHits(ov)], g2[subjectHits(ov)])
   GRanges with 3 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
     [1]     chr1   [1,  8]      -
     [2]     chr2   [2, 15]      +
     [3]     chr2   [3, 20]      +
     ---
     seqlengths:
      chr1 chr2 chr3
        NA   NA   NA

Now you would need to explain why you don't want to see the range
on chr1 in that result...

Thanks,
H.

>
> Thanks again,
>
> Shanrong
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Friday, March 14, 2014 3:19 PM
> To: Zhao, Shanrong [JRDUS]
> Cc: bioconductor at r-project.org
> Subject: Re: overlap regions between two GRanges (or GRangesList)
>
> Hi Zhao,
>
> I'm going to try to rephrase what you want to do: seems like you want to
> keep the ranges in 'g' that have an overlap with any of the ranges in 'g2':
>
>     > g
>
>     GRanges with 4 ranges and 0 metadata columns:
>
>           seqnames    ranges strand
>
>              <Rle> <IRanges>  <Rle>
>
>       [1]     chr1  [ 1, 12]      -
>
>       [2]     chr2  [ 2, 15]      +
>
>       [3]     chr2  [ 3, 20]      +
>
>       [4]     chr3  [10, 14]      -
>
>       ---
>
>       seqlengths:
>
>        chr1 chr2 chr3
>
>          NA   NA   NA
>
>     > g2
>
>     GRanges with 2 ranges and 0 metadata columns:
>
>           seqnames    ranges strand
>
>              <Rle> <IRanges>  <Rle>
>
>       [1]     chr1   [1,  8]      -
>
>       [2]     chr2   [2, 30]      +
>
>       ---
>
>       seqlengths:
>
>        chr1 chr2
>
>          NA   NA
>
>     > subsetByOverlaps(g, g2)
>
>     GRanges with 3 ranges and 0 metadata columns:
>
>           seqnames    ranges strand
>
>              <Rle> <IRanges>  <Rle>
>
>       [1]     chr1   [1, 12]      -
>
>       [2]     chr2   [2, 15]      +
>
>       [3]     chr2   [3, 20]      +
>
>       ---
>
>       seqlengths:
>
>        chr1 chr2 chr3
>
>          NA   NA   NA
>
> However, in the desired output you show below, you omitted the 1st range
> (the range on chr1). So it's not clear to me what you are really after.
> Did you omit it because it's not *within* the overlapping range in 'g2'?
> If that's the case, then:
>
>     > subsetByOverlaps(g, g2, type="within")
>
>     GRanges with 2 ranges and 0 metadata columns:
>
>           seqnames    ranges strand
>
>              <Rle> <IRanges>  <Rle>
>
>       [1]     chr2   [2, 15]      +
>
>       [2]     chr2   [3, 20]      +
>
>       ---
>
>       seqlengths:
>
>        chr1 chr2 chr3
>
>          NA   NA   NA
>
> HTH,
>
> H.
>
> On 03/13/2014 10:12 PM, Zhao, Shanrong [JRDUS] wrote:
>
>  > Seems intersect can do what I want, but not quietly exact.
>
>  >
>
>  > I would like to output two records (for b,c) below instead of unite
>
>  > them into  a single record [2, 20] when call *intersect(g,g2)**.*
>
>  >
>
>  >                  [2, 15]
>
>  >
>
>  >                  [3, 20]
>
>  >
>
>  > Thanks,
>
>  >
>
>  > Shanrong
>
>  >
>
>  > P.S
>
>  >
>
>  >> g
>
>  >
>
>  > GRanges with 4 ranges and 2 metadata columns:
>
>  >
>
>  >      seqnames    ranges strand |     score                GC
>
>  >
>
>  >         <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>
>  >
>
>  >    a     chr1  [ 1, 12]      - |         1                 1
>
>  >
>
>  > b     chr2  [ 2, 15]      + |         2 0.888888888888889
>
>  >
>
>  >    c     chr2  [ 3, 20]      + |         3 0.777777777777778
>
>  >
>
>  >    j     chr3  [10, 14]      - |        10                 0
>
>  >
>
>  >    ---
>
>  >
>
>  >    seqlengths:
>
>  >
>
>  >     chr1 chr2 chr3
>
>  >
>
>  >       NA   NA   NA
>
>  >
>
>  >> g2
>
>  >
>
>  > GRanges with 2 ranges and 2 metadata columns:
>
>  >
>
>  >      seqnames    ranges strand |     score                GC
>
>  >
>
>  >         <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>
>  >
>
>  >    a     chr1   [1,  8]      - |         1                 1
>
>  >
>
>  > *  b     chr2   [2, 30]      + |         2 0.888888888888889*
>
>  >
>
>  >    ---
>
>  >
>
>  >    seqlengths:
>
>  >
>
>  >     chr1 chr2 chr3
>
>  >
>
>  >       NA   NA   NA
>
>  >
>
>  >> intersect(g,g2)
>
>  >
>
>  > GRanges with 2 ranges and 0 metadata columns:
>
>  >
>
>  >        seqnames    ranges strand
>
>  >
>
>  >           <Rle> <IRanges>  <Rle>
>
>  >
>
>  >    [1]     chr1   [1,  8]      -
>
>  >
>
>  > [2]     chr2   [*2, 20*]      +
>
>  >
>
>  >    ---
>
>  >
>
>  >    seqlengths:
>
>  >
>
>  >     chr1 chr2 chr3
>
>  >
>
>  >       NA   NA   NA
>
>  >
>
>  > *From:*Zhao, Shanrong [JRDUS]
>
>  > *Sent:* Thursday, March 13, 2014 9:40 PM
>
>  > *To:* 'hpages at fhcrc.org'
>
>  > *Subject:* overlap regions between two GRanges (or GRangesList)
>
>  >
>
>  > Dear Dr. Pages,
>
>  >
>
>  > I am exploring Bioconductor packages to analyze DNA methylation data.
>
>  > One question I don’t know how to solve it. I have two GRanges (OR
>
>  > GRangesList),  Now I want to identify the common (*overlapping)
>
>  > regions*, not just overlapping or not--- *gc <- overlapRegions(ga,gb)*
>
>  >
>
>  > The other question I have: I am interested in all *cytosines* in
>
>  > promoters regions, I have already had promoter in GRange object. What is
>
>  > the most efficient way to identify the total number of Cs?   I plan to
>
>  > extract all DNA sequences corresponding to promoters, and then call
>
>  > letterFrequency (by set letters=”C”).
>
>  >
>
>  > Best regards,
>
>  >
>
>  > Shanrong
>
>  >
>
> --
>
> Hervé Pagès
>
> Program in Computational Biology
>
> Division of Public Health Sciences
>
> Fred Hutchinson Cancer Research Center
>
> 1100 Fairview Ave. N, M1-B514
>
> P.O. Box 19024
>
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>
> Phone:  (206) 667-5791
>
> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319