[BioC] GenomicRanges::nearest (especially follow)

Valerie Obenchain vobencha at fhcrc.org
Tue Oct 16 20:25:19 CEST 2012


On 10/16/2012 09:25 AM, Harris A. Jaffee wrote:
> Fantastic, thank you.  I'm not sure I knew about that page!
>
> For example, ?precede gives me the IRanges nearest-methods page,
> and only if I explicitly request ?'nearest-methods' can I choose
> the GenomicRanges page.  That brings up another issue: ?distance
> again sends me to the IRanges nearest-methods help, and that only
> documents distanceToNearest, which is not enough.  The spec for
> distance actually lives only on
> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see.

I can update the IRanges man page to include distance().
> Finally, and admittedly a total edge case, but zero-width ranges
> appear to be mishandled, or not handled, by the IRanges distance
> code.  I think they should be infinitely far from anything, or at
> least NA, and there should not be any nearest or distanceToNearest
> value for a zero-width query.
cc'ing Herve and Michael for history.

In general I agree that zero-width ranges should not have a distance 
from any other range. I favor returning NA instead of inf since the 
range (albeit zero-width) does have a location.

One concern I have is that we use zero-width ranges to represent 
insertions. In that context we may want to compute the distance to the 
nearest insertion. I do not have a concrete use case - just thinking of 
what might come. I'd be interested in opinions on this.

Valerie


>> x
> IRanges of length 1
>      start end width
> [1]     2   1     0
>
>> distance(x, x)
> [1] 1
>
>> y
> IRanges of length 1
>      start end width
> [1]     3   2     0
>
>> distance(x, y)
> [1] 2
>
>> z
> IRanges of length 1
>      start end width
> [1]     3   3     1
>
>> distance(x, z)
> [1] 2
>
> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote:
>
>> A few follow up items from an off-list conversation with Harris.
>>
>> 1) I reversed the use of 5' and3' in my response below. Sorry about that.
>>
>> 2) The primary landing page for ?precede is in IRanges. I've expanded the \seealso section of this page to point the user to the man page in GenomicRanges.
>>
>> 3) The precede/follow man page in GenomicRanges now  specifically states that 5' to 3' is the relevant orientation for precede/follow. For those interested in taking a look, the man page has several aliases,
>>
>> ?'nearest-methods'
>> ?'precede,GenomicRanges,GenomicRanges-method'
>> ?'follow,GenomicRanges,GenomicRanges-method'
>>
>> Changes can be found in IRanges 1.17.3  and GenomicRanges 1.11.4.
>>
>> Thanks Harris.
>>
>> Valerie
>>
>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
>>> Hi Harris,
>>>
>>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>>>> Apologies in advance if I just don't understand the ignore.strand switch, or
>>>> perhaps GRanges objects.  Also, I have not even tried to understand
>>>>
>>>>     GenomicRanges:::.GenomicRanges_findPrecedeFollow
>>>>
>>>> I have the impression that a query with strand entirely "*" essentially implies
>>>> ignore.strand, but this case here is handled correctly only if ignore.strand is
>>>> TRUE (consistent with distanceToNearest but not distance, which seems correct):
>>>>
>>>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", strand="*")
>>>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), seqnames="chr1", strand=c("+","-"))
>>>>> distance(x, Y[1])
>>>> [1] 1
>>>>> distance(x, Y[2])
>>>> [1] 2
>>>>
>>>>> nearest(x, Y, ignore.strand=TRUE)    # correct
>>>> [1] 1
>>>>> nearest(ranges(x), ranges(Y))           # also correct
>>>> [1] 1
>>>>
>>>> However,
>>>>
>>>>> nearest(x, Y)
>>>> [1] 2
>>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 1.10.2.
>>>
>>> The problem was in .nearest() where I computed the preceding and following distance of the ranges. I was missing abs() so the distance for the second range was -2 instead of 2. When checking to see which range was closest
>>>
>>>> 1<  -2
>>> [1] FALSE
>>>
>>> so the second range won.
>>>>> distanceToNearest(x, Y)
>>>> DataFrame with 1 row and 3 columns
>>>>   queryHits subjectHits  distance
>>>> <integer>  <integer>  <integer>
>>>> 1         1           2         2
>>>>
>>>>> distanceToNearest(x, Y, ignore.strand=TRUE)
>>>> DataFrame with 1 row and 3 columns
>>>>    queryHits subjectHits  distance
>>>> <integer>  <integer>  <integer>
>>>> 1         1           1         1
>>>>
>>>>
>>>> Finally,
>>>>
>>>>> follow(ranges(x), ranges(Y))
>>>> [1] NA
>>>>
>>>> The issue (along with GenomicRanges:::.nearest) must come down to this:
>>>>
>>>>> follow(x, Y)                # how can this be?
>>>> [1] 2
>>> This behavior is due to the fact that a '*' range can compare itself to either '+' or '-'.  The 'x' is '*" located at position 5.  When comparing it to the '+'  Y[1]
>>> we think of both ranges as '+'. precede() and follow() locations are determined by moving from 3' to 5'. On '+' strand this is from left to right so 5 precedes 6.
>>>
>>>> precede(x, Y[1])
>>> [1] 1
>>>
>>> but does not follow it
>>>
>>>> follow(x, Y[1])
>>> [1] NA
>>>
>>> When comparing the '*' to Y[2] we think of both ranges as '-'. On the '-' moving from  3' to 5' is right to left so the  5 follows 7
>>>
>>>> follow(x, Y[2])
>>> [1] 1
>>>
>>> but does not precede it
>>>
>>>> precede(x, Y[2])
>>> [1] NA
>>>
>>> When we set ignore.strand=TRUE, all ranges are considered '+'.
>>>
>>> Valerie
>>>> whereas
>>>>
>>>>> follow(x, Y, ignore.strand=TRUE)    # correct, I think
>>>> [1] NA
>>>>
>>>>> sessionInfo()
>>>> R Under development (unstable) (2012-10-10 r60908)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>>>   [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>>>>   [7] LC_PAPER=C                     LC_NAME=C
>>>>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.11.0 IRanges_1.17.0       BiocGenerics_0.5.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.16.0 stats4_2.16.0   tools_2.16.0
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list