[BioC] [GenomicRanges] subsetByOverlaps to keep info from both GRanges objects?

Wed Aug 21 13:10:33 CEST 2013

Hi Valerie,

Thanks very much for your reply.
You propose an interesting approach, which unfortunately doesn't work
for my specific case.

In short, the problem is that multiple SNPs (single nucleotides) can
map  to a single DNase HS site (often very large regions).
I have edited your example a bit so that it's easier to reproduce the issue:

##########
> gr1 <- GRanges("chr1", IRanges(c(5, 10, 15), c(5, 10, 15), names=paste0("rsid:", letters[1:3])), score=rep(0,3))
> gr1
GRanges with 3 ranges and 1 metadata column:
         seqnames    ranges strand |     score
            <Rle> <IRanges>  <Rle> | <numeric>
  rsid:a     chr1  [ 5,  5]      * |         0
  rsid:b     chr1  [10, 10]      * |         0
  rsid:c     chr1  [15, 15]      * |         0
  ---
  seqlengths:
   chr1
     NA

> gr2 <- GRanges("chr1", IRanges(c(4,6,45), c(8,20,50), names=paste0("dnase:", letters[1:3])), score=rep(0,3))
> gr2
GRanges with 3 ranges and 1 metadata column:
          seqnames    ranges strand |     score
             <Rle> <IRanges>  <Rle> | <numeric>
  dnase:a     chr1  [ 4,  8]      * |         0
  dnase:b     chr1  [ 6, 20]      * |         0
  dnase:c     chr1  [45, 50]      * |         0
  ---
  seqlengths:
   chr1
     NA

> ranges <- subsetByOverlaps(gr2,gr1)
> ranges
GRanges with 2 ranges and 1 metadata column:
          seqnames    ranges strand |     score
             <Rle> <IRanges>  <Rle> | <numeric>
  dnase:a     chr1   [4,  8]      * |         0
  dnase:b     chr1   [6, 20]      * |         0
  ---
  seqlengths:
   chr1
     NA

> revRanges <- subsetByOverlaps(gr1,gr2)
> revRanges
GRanges with 3 ranges and 1 metadata column:
         seqnames    ranges strand |     score
            <Rle> <IRanges>  <Rle> | <numeric>
  rsid:a     chr1  [ 5,  5]      * |         0
  rsid:b     chr1  [10, 10]      * |         0
  rsid:c     chr1  [15, 15]      * |         0
  ---
  seqlengths:
   chr1
     NA

> hits <- findOverlaps(gr2, gr1)
> idx <- unique(subjectHits(hits))
> values <- DataFrame(rsid=names(gr1)[idx])
> mcols(ranges) <- c(mcols(ranges), values)
> ranges
GRanges with 2 ranges and 2 metadata columns:
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,  :
  arguments imply differing number of rows: 2, 3
##########

As you can see, rsid:a is in the dnase:a region, but both rsid:b and
rsid:c hit the dnase:b region. As a result, when you try to add the
SNPs IDs from gr1 as metacolumns of ranges, you create a GRanges
object with different number of rows.

Is there any workaround for this?

At the moment I'm creating two GRanges objects for each comparison,
one keeping the SNPs IDs and coordinates, and the other storing the
DNase regions hit by the SNPs, but this is less than ideal and
produces a lot of unnecessary output (R objects and exported BED
files). Ideally, I'd like to have a GRanges object where each row is a
DNase HS site hit by a specific SNP.

Thank you.
Best,

On 20 August 2013 17:57, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hi Enrico,
>
> Here is a toy example of two GRanges, one with snp locations and the other
> with dnase.
>
> gr1 <- GRanges("chr1", IRanges(1:5, width=5,
>                names=paste0("rsid:", letters[1:5])), score=1:5)
>> gr1
>
> GRanges with 5 ranges and 1 metadata column:
>          seqnames    ranges strand |     score
>             <Rle> <IRanges>  <Rle> | <integer>
>   rsid:a     chr1    [1, 5]      * |         1
>   rsid:b     chr1    [2, 6]      * |         2
>   rsid:c     chr1    [3, 7]      * |         3
>   rsid:d     chr1    [4, 8]      * |         4
>   rsid:e     chr1    [5, 9]      * |         5
>   ---
>   seqlengths:
>    chr1
>      NA
>
> gr2 <- GRanges("chr1", IRanges(8:12, width=5,
>                names=paste0("dnase:", letters[1:5])), score=10:14)
>> gr2
>
> GRanges with 5 ranges and 1 metadata column:
>           seqnames    ranges strand |     score
>              <Rle> <IRanges>  <Rle> | <integer>
>   dnase:a     chr1  [ 8, 12]      * |        10
>   dnase:b     chr1  [ 9, 13]      * |        11
>   dnase:c     chr1  [10, 14]      * |        12
>   dnase:d     chr1  [11, 15]      * |        13
>   dnase:e     chr1  [12, 16]      * |        14
>   ---
>   seqlengths:
>    chr1
>      NA
>
>
> ranges <- subsetByOverlaps(gr2, gr1)
>> ranges
> GRanges with 2 ranges and 1 metadata column:
>
>           seqnames    ranges strand |     score
>              <Rle> <IRanges>  <Rle> | <integer>
>   dnase:a     chr1   [8, 12]      * |        10
>   dnase:b     chr1   [9, 13]      * |        11
>   ---
>   seqlengths:
>    chr1
>      NA
>
> The function called by subsetByOverlaps is findOverlaps (described on same
> man page as ?subsetByOverlaps). The output of findOverlaps is a 'Hits'
> object indicating which of the query and subject overlap.
>
> hits <- findOverlaps(gr2, gr1)
>> hits
> Hits of length 3
> queryLength: 5
> subjectLength: 5
>   queryHits subjectHits
>    <integer>   <integer>
>  1         1           4
>  2         1           5
>  3         2           5
>
> We want meta information from gr1 (snps). In the call to findOverlaps gr1
> was the subject so we use the 'subjectHits' indices to subset the snp
> GRanges.
>
> idx <- unique(subjectHits(hits))
> values <- DataFrame(rsid=names(gr1)[idx], snpscore=gr1$score[idx])
>
> Add the metadata to the dnase ranges.
>
> mcols(ranges) <- c(mcols(ranges), values)
>> ranges
> GRanges with 2 ranges and 3 metadata columns:
>           seqnames    ranges strand |     score        rsid  snpscore
>              <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
>   dnase:a     chr1   [8, 12]      * |        10      rsid:d         4
>   dnase:b     chr1   [9, 13]      * |        11      rsid:e         5
>   ---
>   seqlengths:
>    chr1
>      NA
>
>
> Valerie
>
>
> On 08/20/2013 03:51 AM, Enrico Ferrero wrote:
>>
>> Hi,
>>
>> I have two GRanges objects, the first one is a list of SNPs, the
>> second one are DNase hypersensitivity sites:
>>
>> ##########
>> library(GenomicRanges)
>> ...
>>
>>> snp
>>
>> GRanges with 192 ranges and 1 metadata column:
>>               seqnames                 ranges strand   |     score
>>                  <Rle>              <IRanges>  <Rle>   | <integer>
>>      rs000001     chr1 [ 37967779,  37967780]      +   |         0
>>      rs000002     chr1 [165967416, 165967417]      -   |         0
>>      rs000003     chr1 [218860069, 218860070]      -   |         0
>>     rs000004     chr1 [ 17306673,  17306674]      -   |         0
>>     rs000005     chr1 [ 41293414,  41293415]      +   |         0
>>           ...      ...                    ...    ... ...       ...
>>     rs000188     chr8 [ 97522507,  97522508]      -   |         0
>>     rs000189     chr8 [ 15532582,  15532583]      +   |         0
>>     rs000190     chr8 [ 72270031,  72270032]      +   |         0
>>    rs000191     chr9 [126511086, 126511087]      +   |         0
>>    rs000192     chr9 [ 98231008,  98231009]      +   |         0
>>    ---
>>    seqlengths:
>>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 ... chr21
>> chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>        NA    NA    NA    NA    NA    NA    NA    NA    NA ...    NA
>> NA    NA    NA    NA    NA    NA    NA    NA
>>
>>> dnase
>>
>> GRanges with 145038 ranges and 1 metadata column:
>>             seqnames                 ranges strand   |     score
>>                <Rle>              <IRanges>  <Rle>   | <integer>
>>         [1]     chr1       [ 10120,  10270]      *   |         0
>>         [2]     chr1       [237700, 237850]      *   |         0
>>         [3]     chr1       [521440, 521590]      *   |         0
>>         [4]     chr1       [565560, 565710]      *   |         0
>>         [5]     chr1       [565860, 566010]      *   |         0
>>         ...      ...                    ...    ... ...       ...
>>    [145034]     chrX [154543640, 154543790]      *   |         0
>>    [145035]     chrX [154560420, 154560570]      *   |         0
>>    [145036]     chrX [154563960, 154564110]      *   |         0
>>    [145037]     chrX [154842100, 154842250]      *   |         0
>>    [145038]     chrX [154862200, 154862350]      *   |         0
>>
>> ---
>> seqlengths:
>>    chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>   chrY
>>      NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>> NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>   NA
>> ##########
>>
>> I can use subsetByOverlaps() in both directions to compute the overlap
>> between them and return a GRanges object:
>>
>> ##########
>>>
>>> subsetByOverlaps(dnase, snp)
>>
>> GRanges with 5 ranges and 1 metadata column:
>>        seqnames                 ranges strand |     score
>>           <Rle>              <IRanges>  <Rle> | <integer>
>>    [1]     chr1 [ 17306560,  17306710]      * |         0
>>    [2]     chr2 [169869820, 169869970]      * |         0
>>    [3]     chr4 [145506440, 145506590]      * |         0
>>    [4]     chr5 [ 15014080,  15014230]      * |         0
>>    [5]     chr5 [ 15117400,  15117550]      * |         0
>>    ---
>>    seqlengths:
>>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
>>   chrY
>>        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>    NA
>>
>>> subsetByOverlaps(snp, dnase)
>>
>> GRanges with 6 ranges and 1 metadata column:
>>               seqnames                 ranges strand |     score
>>                  <Rle>              <IRanges>  <Rle> | <integer>
>>     rs2235746     chr1 [ 17306671,  17306672]      - |         0
>>     rs4157777     chr2 [169869904, 169869904]      - |         0
>>     rs6858330     chr4 [145506558, 145506559]      + |         0
>>    rs13146741     chr4 [145506453, 145506454]      + |         0
>>       rs32847     chr5 [ 15117438,  15117439]      + |         0
>>     rs7341842     chr5 [ 15014184,  15014185]      + |         0
>>    ---
>>    seqlengths:
>>      chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19
>> chr2 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9
>>        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>>   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>> ##########
>>
>> The first GRanges objects stores the DNase genomic locations
>> overlapping with the SNPs, while the second one contains the SNPs IDs
>> (as GRanges names) and genomic locations overlapping with the DNase
>> dataset.
>>
>> Now, what I actually need is a GRanges object that stores the SNPs IDs
>> and the DNase genomic locations. Is this possible?
>>
>> Thank you.
>> Best,
>>
>>
>

-- 
Enrico Ferrero
Department of Genetics
Cambridge Systems Biology Centre
University of Cambridge