[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?

Tue Aug 20 18:57:38 CEST 2013

Hi Enrico,

Here is a toy example of two GRanges, one with snp locations and the 
other with dnase.

gr1 <- GRanges("chr1", IRanges(1:5, width=5,
                names=paste0("rsid:", letters[1:5])), score=1:5)
 > gr1
GRanges with 5 ranges and 1 metadata column:
          seqnames    ranges strand |     score
             <Rle> <IRanges>  <Rle> | <integer>
   rsid:a     chr1    [1, 5]      * |         1
   rsid:b     chr1    [2, 6]      * |         2
   rsid:c     chr1    [3, 7]      * |         3
   rsid:d     chr1    [4, 8]      * |         4
   rsid:e     chr1    [5, 9]      * |         5
   ---
   seqlengths:
    chr1
      NA

gr2 <- GRanges("chr1", IRanges(8:12, width=5,
                names=paste0("dnase:", letters[1:5])), score=10:14)
 > gr2
GRanges with 5 ranges and 1 metadata column:
           seqnames    ranges strand |     score
              <Rle> <IRanges>  <Rle> | <integer>
   dnase:a     chr1  [ 8, 12]      * |        10
   dnase:b     chr1  [ 9, 13]      * |        11
   dnase:c     chr1  [10, 14]      * |        12
   dnase:d     chr1  [11, 15]      * |        13
   dnase:e     chr1  [12, 16]      * |        14
   ---
   seqlengths:
    chr1
      NA

ranges <- subsetByOverlaps(gr2, gr1)
 > ranges
GRanges with 2 ranges and 1 metadata column:
           seqnames    ranges strand |     score
              <Rle> <IRanges>  <Rle> | <integer>
   dnase:a     chr1   [8, 12]      * |        10
   dnase:b     chr1   [9, 13]      * |        11
   ---
   seqlengths:
    chr1
      NA

The function called by subsetByOverlaps is findOverlaps (described on 
same man page as ?subsetByOverlaps). The output of findOverlaps is a 
'Hits' object indicating which of the query and subject overlap.

hits <- findOverlaps(gr2, gr1)
 > hits
Hits of length 3
queryLength: 5
subjectLength: 5
   queryHits subjectHits
    <integer>   <integer>
  1         1           4
  2         1           5
  3         2           5

We want meta information from gr1 (snps). In the call to findOverlaps 
gr1 was the subject so we use the 'subjectHits' indices to subset the 
snp GRanges.

idx <- unique(subjectHits(hits))
values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])

Add the metadata to the dnase ranges.

mcols(ranges) <- c(mcols(ranges), values)
 > ranges
GRanges with 2 ranges and 3 metadata columns:
           seqnames    ranges strand |     score        rsid  snpscore
              <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
   dnase:a     chr1   [8, 12]      * |        10      rsid:d         4
   dnase:b     chr1   [9, 13]      * |        11      rsid:e         5
   ---
   seqlengths:
    chr1
      NA

Valerie

On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
> Hi,
>
> I have two GRanges objects, the first one is a list of SNPs, the
> second one are DNase hypersensitivity sites:
>
> ##########
> library(GenomicRanges)
> ...
>
>> snp
> GRanges with 192 ranges and 1 metadata column:
>               seqnames                 ranges strand   |     score
>                  <Rle>              <IRanges>  <Rle>   | <integer>
>      rs000001     chr1 [ 37967779,  37967780]      +   |         0
>      rs000002     chr1 [165967416, 165967417]      -   |         0
>      rs000003     chr1 [218860069, 218860070]      -   |         0
>     rs000004     chr1 [ 17306673,  17306674]      -   |         0
>     rs000005     chr1 [ 41293414,  41293415]      +   |         0
>           ...      ...                    ...    ... ...       ...
>     rs000188     chr8 [ 97522507,  97522508]      -   |         0
>     rs000189     chr8 [ 15532582,  15532583]      +   |         0
>     rs000190     chr8 [ 72270031,  72270032]      +   |         0
>    rs000191     chr9 [126511086, 126511087]      +   |         0
>    rs000192     chr9 [ 98231008,  98231009]      +   |         0
>    ---
>    seqlengths:
>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
> chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>        NA    NA    NA    NA    NA    NA    NA    NA    NA ...    NA
> NA    NA    NA    NA    NA    NA    NA    NA
>
>> dnase
> GRanges with 145038 ranges and 1 metadata column:
>             seqnames                 ranges strand   |     score
>                <Rle>              <IRanges>  <Rle>   | <integer>
>         [1]     chr1       [ 10120,  10270]      *   |         0
>         [2]     chr1       [237700, 237850]      *   |         0
>         [3]     chr1       [521440, 521590]      *   |         0
>         [4]     chr1       [565560, 565710]      *   |         0
>         [5]     chr1       [565860, 566010]      *   |         0
>         ...      ...                    ...    ... ...       ...
>    [145034]     chrX [154543640, 154543790]      *   |         0
>    [145035]     chrX [154560420, 154560570]      *   |         0
>    [145036]     chrX [154563960, 154564110]      *   |         0
>    [145037]     chrX [154842100, 154842250]      *   |         0
>    [145038]     chrX [154862200, 154862350]      *   |         0
>
> ---
> seqlengths:
>    chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>   chrY
>      NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>   NA
> ##########
>
> I can use subsetByOverlaps() in both directions to compute the overlap
> between them and return a GRanges object:
>
> ##########
>> subsetByOverlaps(dnase, snp)
> GRanges with 5 ranges and 1 metadata column:
>        seqnames                 ranges strand |     score
>           <Rle>              <IRanges>  <Rle> | <integer>
>    [1]     chr1 [ 17306560,  17306710]      * |         0
>    [2]     chr2 [169869820, 169869970]      * |         0
>    [3]     chr4 [145506440, 145506590]      * |         0
>    [4]     chr5 [ 15014080,  15014230]      * |         0
>    [5]     chr5 [ 15117400,  15117550]      * |         0
>    ---
>    seqlengths:
>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>   chrY
>        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>    NA
>
>> subsetByOverlaps(snp, dnase)
> GRanges with 6 ranges and 1 metadata column:
>               seqnames                 ranges strand |     score
>                  <Rle>              <IRanges>  <Rle> | <integer>
>     rs2235746     chr1 [ 17306671,  17306672]      - |         0
>     rs4157777     chr2 [169869904, 169869904]      - |         0
>     rs6858330     chr4 [145506558, 145506559]      + |         0
>    rs13146741     chr4 [145506453, 145506454]      + |         0
>       rs32847     chr5 [ 15117438,  15117439]      + |         0
>     rs7341842     chr5 [ 15014184,  15014185]      + |         0
>    ---
>    seqlengths:
>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
> chr2 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> ##########
>
> The first GRanges objects stores the DNase genomic locations
> overlapping with the SNPs, while the second one contains the SNPs IDs
> (as GRanges names) and genomic locations overlapping with the DNase
> dataset.
>
> Now, what I actually need is a GRanges object that stores the SNPs IDs
> and the DNase genomic locations. Is this possible?
>
> Thank you.
> Best,
>
>