[BioC] GenomicRanges::nearest (especially follow)

Cook, Malcolm MEC at stowers.org
Wed Oct 17 17:31:52 CEST 2012


Indeed, for distance to behave as a metric it must be that distance(x,x)
== 0 for all x

Here is some further (consistent) oddness:

> compare(GRanges('x',IRanges(4,3)),GRanges('x',IRanges(4,3)))
[1] -5


I would expect the value to be 0 here.  Comparing x with itself should
yield 0.  After all they are identical!

--Malcolm


On 10/17/12 9:50 AM, "Tim Triche, Jr." <tim.triche at gmail.com> wrote:

>Hmm, that's odd.  And it seems inconsistent with the usual distance() and
>distanceToNearest() methods, which would return 0 in this case, yes?
>
>e.g. if the IRanges were GRanges and sitting on top of each other on the
>same strand, or if the first was distance()'ed against itself (the same
>thing)
>
>That doesn't make sense to me either.  This doesn't seem like an issue of
>theology to me, but rather inconsistency with what the docs say will
>happen.
>
>
>
>On Wed, Oct 17, 2012 at 7:31 AM, Harris A. Jaffee <hj at jhu.edu> wrote:
>
>> As I said, any rule is fine, especially if they can serve a purpose
>> via their "location".  But you also have this, which seems weird:
>>
>>         > distance(IRanges(2,1), IRanges(2,1))
>>         [1] 1
>>
>> To say they are not treated any differently is a big stretch to me,
>> since their defining formula (end = start-1) is already a violation
>> of my senses.  [I wonder if SEW = (start, NA, 0) wouldn't have been
>> more useful.]
>>
>> Just so we don't continue off into useless theology, I should say that
>> I started looking into the distance code with the idea of adding a
>> "signed" option (say, distance(x,y)<0 iff x is downstream from y), and
>> I wanted to understand the landscape first.
>>
>> On Oct 17, 2012, at 12:56 AM, Hervé Pagès wrote:
>>
>> > Hi Harris, Val, Michael,
>> >
>> > Right now, all those distances are equal to 8, which I think is what
>> > we want:
>> >
>> >  distance(IRanges(1,2), IRanges(10,16))
>> >  distance(IRanges(10,16), IRanges(1,2))
>> >  distance(IRanges(1,2), IRanges(10,9))
>> >  distance(IRanges(10,9),IRanges(1,2))
>> >  distance(IRanges(3,2), IRanges(10,16))
>> >  distance(IRanges(10,16), IRanges(3,2))
>> >  distance(IRanges(3,2), IRanges(10,9))
>> >  distance(IRanges(10,9), IRanges(3,2))
>> >
>> > Yes, from a mathematical point of view, the distance between a
>> > non-empty set and an empty set if not defined, but I'm not sure there
>> > would be much to gain in doing this. Some use cases would be needed.
>> > What's nice about the current behavior is that zero-width ranges are
>> > not treated any differently than other ranges.
>> >
>> > H.
>> >
>> >
>> > On 10/16/2012 12:21 PM, Harris A. Jaffee wrote:
>> >> Any rule is fine with me.  They would not occur for us except by
>> mistake.
>> >> I was basically raising an alert that the current situation is not
>>quite
>> >> complete.
>> >>
>> >> On Oct 16, 2012, at 2:53 PM, Michael Lawrence wrote:
>> >>
>> >>> Since these zero-width ranges do have a position, I'm not sure why
>>we
>> cannot calculate their distance. As long as we have an established rule
>> about how we handle them. For example, is it just from the start? That's
>> probably most intuitive.
>> >>>
>> >>> Michael
>> >>>
>> >>> On Tue, Oct 16, 2012 at 11:25 AM, Valerie Obenchain <
>> vobencha at fhcrc.org> wrote:
>> >>> On 10/16/2012 09:25 AM, Harris A. Jaffee wrote:
>> >>> Fantastic, thank you.  I'm not sure I knew about that page!
>> >>>
>> >>> For example, ?precede gives me the IRanges nearest-methods page,
>> >>> and only if I explicitly request ?'nearest-methods' can I choose
>> >>> the GenomicRanges page.  That brings up another issue: ?distance
>> >>> again sends me to the IRanges nearest-methods help, and that only
>> >>> documents distanceToNearest, which is not enough.  The spec for
>> >>> distance actually lives only on
>> >>> ?'precede,GenomicRanges,GenomicRanges-method', as far as I see.
>> >>>
>> >>> I can update the IRanges man page to include distance().
>> >>>
>> >>> Finally, and admittedly a total edge case, but zero-width ranges
>> >>> appear to be mishandled, or not handled, by the IRanges distance
>> >>> code.  I think they should be infinitely far from anything, or at
>> >>> least NA, and there should not be any nearest or distanceToNearest
>> >>> value for a zero-width query.
>> >>> cc'ing Herve and Michael for history.
>> >>>
>> >>> In general I agree that zero-width ranges should not have a distance
>> from any other range. I favor returning NA instead of inf since the
>>range
>> (albeit zero-width) does have a location.
>> >>>
>> >>> One concern I have is that we use zero-width ranges to represent
>> insertions. In that context we may want to compute the distance to the
>> nearest insertion. I do not have a concrete use case - just thinking of
>> what might come. I'd be interested in opinions on this.
>> >>>
>> >>> Valerie
>> >>>
>> >>>
>> >>>
>> >>> x
>> >>> IRanges of length 1
>> >>>      start end width
>> >>> [1]     2   1     0
>> >>>
>> >>> distance(x, x)
>> >>> [1] 1
>> >>>
>> >>> y
>> >>> IRanges of length 1
>> >>>      start end width
>> >>> [1]     3   2     0
>> >>>
>> >>> distance(x, y)
>> >>> [1] 2
>> >>>
>> >>> z
>> >>> IRanges of length 1
>> >>>      start end width
>> >>> [1]     3   3     1
>> >>>
>> >>> distance(x, z)
>> >>> [1] 2
>> >>>
>> >>> On Oct 15, 2012, at 6:27 PM, Valerie Obenchain wrote:
>> >>>
>> >>> A few follow up items from an off-list conversation with Harris.
>> >>>
>> >>> 1) I reversed the use of 5' and3' in my response below. Sorry about
>> that.
>> >>>
>> >>> 2) The primary landing page for ?precede is in IRanges. I've
>>expanded
>> the \seealso section of this page to point the user to the man page in
>> GenomicRanges.
>> >>>
>> >>> 3) The precede/follow man page in GenomicRanges now  specifically
>> states that 5' to 3' is the relevant orientation for precede/follow. For
>> those interested in taking a look, the man page has several aliases,
>> >>>
>> >>> ?'nearest-methods'
>> >>> ?'precede,GenomicRanges,GenomicRanges-method'
>> >>> ?'follow,GenomicRanges,GenomicRanges-method'
>> >>>
>> >>> Changes can be found in IRanges 1.17.3  and GenomicRanges 1.11.4.
>> >>>
>> >>> Thanks Harris.
>> >>>
>> >>> Valerie
>> >>>
>> >>> On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
>> >>> Hi Harris,
>> >>>
>> >>> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>> >>> Apologies in advance if I just don't understand the ignore.strand
>> switch, or
>> >>> perhaps GRanges objects.  Also, I have not even tried to understand
>> >>>
>> >>>     GenomicRanges:::.GenomicRanges_findPrecedeFollow
>> >>>
>> >>> I have the impression that a query with strand entirely "*"
>> essentially implies
>> >>> ignore.strand, but this case here is handled correctly only if
>> ignore.strand is
>> >>> TRUE (consistent with distanceToNearest but not distance, which
>>seems
>> correct):
>> >>>
>> >>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1",
>> strand="*")
>> >>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)),
>>seqnames="chr1",
>> strand=c("+","-"))
>> >>> distance(x, Y[1])
>> >>> [1] 1
>> >>> distance(x, Y[2])
>> >>> [1] 2
>> >>>
>> >>> nearest(x, Y, ignore.strand=TRUE)    # correct
>> >>> [1] 1
>> >>> nearest(ranges(x), ranges(Y))           # also correct
>> >>> [1] 1
>> >>>
>> >>> However,
>> >>>
>> >>> nearest(x, Y)
>> >>> [1] 2
>> >>> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and
>> 1.10.2.
>> >>>
>> >>> The problem was in .nearest() where I computed the preceding and
>> following distance of the ranges. I was missing abs() so the distance
>>for
>> the second range was -2 instead of 2. When checking to see which range
>>was
>> closest
>> >>>
>> >>> 1<  -2
>> >>> [1] FALSE
>> >>>
>> >>> so the second range won.
>> >>> distanceToNearest(x, Y)
>> >>> DataFrame with 1 row and 3 columns
>> >>>   queryHits subjectHits  distance
>> >>> <integer>  <integer>  <integer>
>> >>> 1         1           2         2
>> >>>
>> >>> distanceToNearest(x, Y, ignore.strand=TRUE)
>> >>> DataFrame with 1 row and 3 columns
>> >>>    queryHits subjectHits  distance
>> >>> <integer>  <integer>  <integer>
>> >>> 1         1           1         1
>> >>>
>> >>>
>> >>> Finally,
>> >>>
>> >>> follow(ranges(x), ranges(Y))
>> >>> [1] NA
>> >>>
>> >>> The issue (along with GenomicRanges:::.nearest) must come down to
>>this:
>> >>>
>> >>> follow(x, Y)                # how can this be?
>> >>> [1] 2
>> >>> This behavior is due to the fact that a '*' range can compare itself
>> to either '+' or '-'.  The 'x' is '*" located at position 5.  When
>> comparing it to the '+'  Y[1]
>> >>> we think of both ranges as '+'. precede() and follow() locations are
>> determined by moving from 3' to 5'. On '+' strand this is from left to
>> right so 5 precedes 6.
>> >>>
>> >>> precede(x, Y[1])
>> >>> [1] 1
>> >>>
>> >>> but does not follow it
>> >>>
>> >>> follow(x, Y[1])
>> >>> [1] NA
>> >>>
>> >>> When comparing the '*' to Y[2] we think of both ranges as '-'. On
>>the
>> '-' moving from  3' to 5' is right to left so the  5 follows 7
>> >>>
>> >>> follow(x, Y[2])
>> >>> [1] 1
>> >>>
>> >>> but does not precede it
>> >>>
>> >>> precede(x, Y[2])
>> >>> [1] NA
>> >>>
>> >>> When we set ignore.strand=TRUE, all ranges are considered '+'.
>> >>>
>> >>> Valerie
>> >>> whereas
>> >>>
>> >>> follow(x, Y, ignore.strand=TRUE)    # correct, I think
>> >>> [1] NA
>> >>>
>> >>> sessionInfo()
>> >>> R Under development (unstable) (2012-10-10 r60908)
>> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> >>>
>> >>> locale:
>> >>>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>> >>>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>> >>>   [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>> >>>   [7] LC_PAPER=C                     LC_NAME=C
>> >>>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>> >>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>> >>>
>> >>> attached base packages:
>> >>> [1] stats     graphics  grDevices datasets  utils     methods   base
>> >>>
>> >>> other attached packages:
>> >>> [1] GenomicRanges_1.11.0 IRanges_1.17.0       BiocGenerics_0.5.0
>> >>>
>> >>> loaded via a namespace (and not attached):
>> >>> [1] parallel_2.16.0 stats4_2.16.0   tools_2.16.0
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at r-project.org
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at r-project.org
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>
>> >>>
>> >>
>> >
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fhcrc.org
>> > Phone:  (206) 667-5791
>> > Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
>--
>*A model is a lie that helps you see the truth.*
>*
>*
>Howard 
>Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>        [[alternative HTML version deleted]]
>



More information about the Bioconductor mailing list