[BioC] overlap regions between two GRanges (or GRangesList)

Hervé Pagès hpages at fhcrc.org
Sat Mar 15 00:48:22 CET 2014


Please use the "Reply All" button so this discussion remains on the
mailing list.

On 03/14/2014 04:04 PM, Zhao, Shanrong [JRDUS] wrote:
> The range in chr1 should be kept.
>> intersect(g,g2)
> GRanges with 2 ranges and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]     chr1   [1,  8]      -
>    [2]     chr2   [2, 20]      +
>    ---
>    seqlengths:
>     chr1 chr2 chr3
>       NA   NA   NA
>
> My real question boils down to "chr2 [2,20]". Instead of reducing them, I would rather output "chr2 [2,15]" and "chr2 [3,20]".

The solution I sent you earlier (findOverlaps + pintersect) keeps
the range on chr1 and outputs "chr2 [2,15]" and "chr2 [3,20]".

>
> You bring up a very good point:  the original range might  overlap with more than 1 range.  This question seems more complicated what I thought. But for my practical situation, I cannot find a better solution (better than intersect function can offer)

Hard to provide more suggestion from here without knowing
why the findOverlaps + pintersect solution doesn't work
for you.

I'll respond to your other off-list question about counting the
total number of Cs in your promoters in a separate email.

Cheers,
H.


>
> Thanks,
> Shanrong
>
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Friday, March 14, 2014 3:49 PM
> To: Zhao, Shanrong [JRDUS]
> Cc: bioconductor at r-project.org
> Subject: Re: overlap regions between two GRanges (or GRangesList)
>
> On 03/14/2014 03:31 PM, Zhao, Shanrong [JRDUS] wrote:
>> Thank you. subsetByOverlaps is not what I want.  I just want to keep
>> the ranges in 'g' that overlaps with g2, but only keep *"overlapping"
>> regions* (instead of the original ranges in 'g1').
>
> Thanks for clarifying. But that still doesn't explain why you want to *completely* get rid of the 1st range in 'g' (the range on chr1).
>
> Note that in the general case, when you replace the original range with the overlapping region, you might need more than 1 range for that.
> That's because the original range in 'g' can overlap with more than 1 range in 'g2'. Assuming any given range in 'g' overlaps with at most
> 1 range in 'g2':
>
>     > ov <- findOverlaps(g, g2)
>     > pintersect(g[queryHits(ov)], g2[subjectHits(ov)])
>     GRanges with 3 ranges and 0 metadata columns:
>           seqnames    ranges strand
>              <Rle> <IRanges>  <Rle>
>       [1]     chr1   [1,  8]      -
>       [2]     chr2   [2, 15]      +
>       [3]     chr2   [3, 20]      +
>       ---
>       seqlengths:
>        chr1 chr2 chr3
>          NA   NA   NA
>
> Now you would need to explain why you don't want to see the range on chr1 in that result...
>
> Thanks,
> H.
>
>>
>> Thanks again,
>>
>> Shanrong
>>
>> -----Original Message-----
>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>> Sent: Friday, March 14, 2014 3:19 PM
>> To: Zhao, Shanrong [JRDUS]
>> Cc: bioconductor at r-project.org
>> Subject: Re: overlap regions between two GRanges (or GRangesList)
>>
>> Hi Zhao,
>>
>> I'm going to try to rephrase what you want to do: seems like you want
>> to keep the ranges in 'g' that have an overlap with any of the ranges in 'g2':
>>
>>      > g
>>
>>      GRanges with 4 ranges and 0 metadata columns:
>>
>>            seqnames    ranges strand
>>
>>               <Rle> <IRanges>  <Rle>
>>
>>        [1]     chr1  [ 1, 12]      -
>>
>>        [2]     chr2  [ 2, 15]      +
>>
>>        [3]     chr2  [ 3, 20]      +
>>
>>        [4]     chr3  [10, 14]      -
>>
>>        ---
>>
>>        seqlengths:
>>
>>         chr1 chr2 chr3
>>
>>           NA   NA   NA
>>
>>      > g2
>>
>>      GRanges with 2 ranges and 0 metadata columns:
>>
>>            seqnames    ranges strand
>>
>>               <Rle> <IRanges>  <Rle>
>>
>>        [1]     chr1   [1,  8]      -
>>
>>        [2]     chr2   [2, 30]      +
>>
>>        ---
>>
>>        seqlengths:
>>
>>         chr1 chr2
>>
>>           NA   NA
>>
>>      > subsetByOverlaps(g, g2)
>>
>>      GRanges with 3 ranges and 0 metadata columns:
>>
>>            seqnames    ranges strand
>>
>>               <Rle> <IRanges>  <Rle>
>>
>>        [1]     chr1   [1, 12]      -
>>
>>        [2]     chr2   [2, 15]      +
>>
>>        [3]     chr2   [3, 20]      +
>>
>>        ---
>>
>>        seqlengths:
>>
>>         chr1 chr2 chr3
>>
>>           NA   NA   NA
>>
>> However, in the desired output you show below, you omitted the 1st
>> range (the range on chr1). So it's not clear to me what you are really after.
>> Did you omit it because it's not *within* the overlapping range in 'g2'?
>> If that's the case, then:
>>
>>      > subsetByOverlaps(g, g2, type="within")
>>
>>      GRanges with 2 ranges and 0 metadata columns:
>>
>>            seqnames    ranges strand
>>
>>               <Rle> <IRanges>  <Rle>
>>
>>        [1]     chr2   [2, 15]      +
>>
>>        [2]     chr2   [3, 20]      +
>>
>>        ---
>>
>>        seqlengths:
>>
>>         chr1 chr2 chr3
>>
>>           NA   NA   NA
>>
>> HTH,
>>
>> H.
>>
>> On 03/13/2014 10:12 PM, Zhao, Shanrong [JRDUS] wrote:
>>
>>   > Seems intersect can do what I want, but not quietly exact.
>>
>>   >
>>
>>   > I would like to output two records (for b,c) below instead of unite
>>
>>   > them into  a single record [2, 20] when call *intersect(g,g2)**.*
>>
>>   >
>>
>>   >                  [2, 15]
>>
>>   >
>>
>>   >                  [3, 20]
>>
>>   >
>>
>>   > Thanks,
>>
>>   >
>>
>>   > Shanrong
>>
>>   >
>>
>>   > P.S
>>
>>   >
>>
>>   >> g
>>
>>   >
>>
>>   > GRanges with 4 ranges and 2 metadata columns:
>>
>>   >
>>
>>   >      seqnames    ranges strand |     score                GC
>>
>>   >
>>
>>   >         <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>>
>>   >
>>
>>   >    a     chr1  [ 1, 12]      - |         1                 1
>>
>>   >
>>
>>   > b     chr2  [ 2, 15]      + |         2 0.888888888888889
>>
>>   >
>>
>>   >    c     chr2  [ 3, 20]      + |         3 0.777777777777778
>>
>>   >
>>
>>   >    j     chr3  [10, 14]      - |        10                 0
>>
>>   >
>>
>>   >    ---
>>
>>   >
>>
>>   >    seqlengths:
>>
>>   >
>>
>>   >     chr1 chr2 chr3
>>
>>   >
>>
>>   >       NA   NA   NA
>>
>>   >
>>
>>   >> g2
>>
>>   >
>>
>>   > GRanges with 2 ranges and 2 metadata columns:
>>
>>   >
>>
>>   >      seqnames    ranges strand |     score                GC
>>
>>   >
>>
>>   >         <Rle> <IRanges>  <Rle> | <integer>         <numeric>
>>
>>   >
>>
>>   >    a     chr1   [1,  8]      - |         1                 1
>>
>>   >
>>
>>   > *  b     chr2   [2, 30]      + |         2 0.888888888888889*
>>
>>   >
>>
>>   >    ---
>>
>>   >
>>
>>   >    seqlengths:
>>
>>   >
>>
>>   >     chr1 chr2 chr3
>>
>>   >
>>
>>   >       NA   NA   NA
>>
>>   >
>>
>>   >> intersect(g,g2)
>>
>>   >
>>
>>   > GRanges with 2 ranges and 0 metadata columns:
>>
>>   >
>>
>>   >        seqnames    ranges strand
>>
>>   >
>>
>>   >           <Rle> <IRanges>  <Rle>
>>
>>   >
>>
>>   >    [1]     chr1   [1,  8]      -
>>
>>   >
>>
>>   > [2]     chr2   [*2, 20*]      +
>>
>>   >
>>
>>   >    ---
>>
>>   >
>>
>>   >    seqlengths:
>>
>>   >
>>
>>   >     chr1 chr2 chr3
>>
>>   >
>>
>>   >       NA   NA   NA
>>
>>   >
>>
>>   > *From:*Zhao, Shanrong [JRDUS]
>>
>>   > *Sent:* Thursday, March 13, 2014 9:40 PM
>>
>>   > *To:* 'hpages at fhcrc.org'
>>
>>   > *Subject:* overlap regions between two GRanges (or GRangesList)
>>
>>   >
>>
>>   > Dear Dr. Pages,
>>
>>   >
>>
>>   > I am exploring Bioconductor packages to analyze DNA methylation data.
>>
>>   > One question I don't know how to solve it. I have two GRanges (OR
>>
>>   > GRangesList),  Now I want to identify the common (*overlapping)
>>
>>   > regions*, not just overlapping or not--- *gc <-
>> overlapRegions(ga,gb)*
>>
>>   >
>>
>>   > The other question I have: I am interested in all *cytosines* in
>>
>>   > promoters regions, I have already had promoter in GRange object.
>> What is
>>
>>   > the most efficient way to identify the total number of Cs?   I plan to
>>
>>   > extract all DNA sequences corresponding to promoters, and then call
>>
>>   > letterFrequency (by set letters="C").
>>
>>   >
>>
>>   > Best regards,
>>
>>   >
>>
>>   > Shanrong
>>
>>   >
>>
>> --
>>
>> Hervé Pagès
>>
>> Program in Computational Biology
>>
>> Division of Public Health Sciences
>>
>> Fred Hutchinson Cancer Research Center
>>
>> 1100 Fairview Ave. N, M1-B514
>>
>> P.O. Box 19024
>>
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>
>> Phone:  (206) 667-5791
>>
>> Fax:    (206) 667-1319
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list