[BioC] Semantics of GenomicRanges gaps()

Hervé Pagès hpages at fhcrc.org
Mon Jun 3 08:36:00 CEST 2013


Hi Alex,

On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
> I just have a quick question/comment about the behaviour of the gaps
> function in the GenomicRanges package, particularly with how it handles
> * strand ranges. The current behaviour is as below:
>
>> range = GRanges(seqnames="1",IRanges(start=300,end=700), strand="*",
>> seqinfo=Seqinfo("1",seqlengths=1000))
>> range
> GRanges with 1 range and 0 metadata columns:
>        seqnames     ranges strand
>           <Rle>  <IRanges>  <Rle>
>    [1]        1 [300, 700]      *
>    ---
>    seqlengths:
>        1
>     1000
>> gaps(range)
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames      ranges strand
>           <Rle>   <IRanges>  <Rle>
>    [1]        1 [  1, 1000]      +
>    [2]        1 [  1, 1000]      -
>    [3]        1 [  1,  299]      *
>    [4]        1 [701, 1000]      *
>    ---
>    seqlengths:
>        1
>     1000
>
> My interpretation of "*" as a strand identifier is that it means the
> range covers both + and - strands and so intuitively I was expecting the
> 'gaps' to only cover the 3rd and 4th ranges returned above (not the
> full-length + and - strand ranges).

It's even worse than that. If there is no range at all on a chromosome,
gaps(gr) will return 3 ranges covering the full chromosome:

   > gr <- GRanges("chr1",
                   IRanges(start=300,end=700),
                   seqlengths=c(chr1=1000,chr2=1000))

   > gr
   GRanges with 1 range and 0 metadata columns:
         seqnames     ranges strand
            <Rle>  <IRanges>  <Rle>
     [1]     chr1 [300, 700]      *
     ---
     seqlengths:
      chr1  chr2
      1000 1000

   > gaps(gr)
   GRanges with 7 ranges and 0 metadata columns:
         seqnames      ranges strand
            <Rle>   <IRanges>  <Rle>
     [1]     chr1 [  1, 1000]      +
     [2]     chr1 [  1, 1000]      -
     [3]     chr1 [  1,  299]      *
     [4]     chr1 [701, 1000]      *
     [5]     chr2 [  1, 1000]      +
     [6]     chr2 [  1, 1000]      -
     [7]     chr2 [  1, 1000]      *
     ---
     seqlengths:
      chr1  chr2
      1000 1000

> The semantics here implies to me
> that the * strand is being thought of as a kind of imaginary third
> strand and hence doesn't overlap with the + or - strands. This is
> contrary to the semantics of findOverlaps which does appear to consider
> a * strand range to overlap with + or - strand ranges:
>
>> findOverlaps(range,gaps(range))
> Hits of length 2
> queryLength: 1
> subjectLength: 4
>    queryHits subjectHits
>     <integer>   <integer>
>   1         1           1
>   2         1           2
>
> To summarise I guess I was expecting findOverlaps(range,gaps(range)) to
> never return any overlaps under any circumstance (that I can think of!).

Let's call this property 2. This sounds indeed like a good property
that it would be nice to have. But an even more important property is
property 1: gaps() must be its own reverse operation i.e.
'gaps(gaps(gr))' must always return 'gr'.

The current behavior of gaps() guarantees property 1. I'm not against
changing gaps() behavior to also have property 2, as long as property
1 is preserved. Suggestions are welcome.

Thanks,
H.

> Do people agree the behaviour of gaps() is not quite intuitive in this
> case?
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.15.2 stats4_2.15.2
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list