[BioC] GenomicRanges::nearest (especially follow)

Valerie Obenchain vobencha at fhcrc.org
Tue Oct 16 00:27:09 CEST 2012


A few follow up items from an off-list conversation with Harris.

1) I reversed the use of 5' and3' in my response below. Sorry about that.

2) The primary landing page for ?precede is in IRanges. I've expanded 
the \seealso section of this page to point the user to the man page in 
GenomicRanges.

3) The precede/follow man page in GenomicRanges now  specifically states 
that 5' to 3' is the relevant orientation for precede/follow. For those 
interested in taking a look, the man page has several aliases,

?'nearest-methods'
?'precede,GenomicRanges,GenomicRanges-method'
?'follow,GenomicRanges,GenomicRanges-method'

Changes can be found in IRanges 1.17.3  and GenomicRanges 1.11.4.

Thanks Harris.

Valerie

On 10/12/2012 04:17 PM, Valerie Obenchain wrote:
> Hi Harris,
>
> On 10/11/2012 02:27 PM, Harris A. Jaffee wrote:
>> Apologies in advance if I just don't understand the ignore.strand 
>> switch, or
>> perhaps GRanges objects.  Also, I have not even tried to understand
>>
>>     GenomicRanges:::.GenomicRanges_findPrecedeFollow
>>
>> I have the impression that a query with strand entirely "*" 
>> essentially implies
>> ignore.strand, but this case here is handled correctly only if 
>> ignore.strand is
>> TRUE (consistent with distanceToNearest but not distance, which seems 
>> correct):
>>
>>> x = GRanges(ranges=IRanges(start=5, end=5), seqnames="chr1", 
>>> strand="*")
>>> Y = GRanges(ranges=IRanges(start=c(6,7), end=c(6,7)), 
>>> seqnames="chr1", strand=c("+","-"))
>>> distance(x, Y[1])
>> [1] 1
>>> distance(x, Y[2])
>> [1] 2
>>
>>> nearest(x, Y, ignore.strand=TRUE)    # correct
>> [1] 1
>>> nearest(ranges(x), ranges(Y))           # also correct
>> [1] 1
>>
>> However,
>>
>>> nearest(x, Y)
>> [1] 2
>
> Thanks for reporting this bug. Now fixed in GenomicRanges 1.11.3 and 
> 1.10.2.
>
>  The problem was in .nearest() where I computed the preceding and 
> following distance of the ranges. I was missing abs() so the distance 
> for the second range was -2 instead of 2. When checking to see which 
> range was closest
>
> > 1 < -2
> [1] FALSE
>
> so the second range won.
>>> distanceToNearest(x, Y)
>> DataFrame with 1 row and 3 columns
>>   queryHits subjectHits  distance
>> <integer> <integer> <integer>
>> 1         1           2         2
>>
>>> distanceToNearest(x, Y, ignore.strand=TRUE)
>> DataFrame with 1 row and 3 columns
>>    queryHits subjectHits  distance
>> <integer> <integer> <integer>
>> 1         1           1         1
>>
>>
>> Finally,
>>
>>> follow(ranges(x), ranges(Y))
>> [1] NA
>>
>> The issue (along with GenomicRanges:::.nearest) must come down to this:
>>
>>> follow(x, Y)                # how can this be?
>> [1] 2
>
> This behavior is due to the fact that a '*' range can compare itself 
> to either '+' or '-'.  The 'x' is '*" located at position 5.  When 
> comparing it to the '+'  Y[1]
>  we think of both ranges as '+'. precede() and follow() locations are 
> determined by moving from 3' to 5'. On '+' strand this is from left to 
> right so 5 precedes 6.
>
> > precede(x, Y[1])
> [1] 1
>
> but does not follow it
>
> > follow(x, Y[1])
> [1] NA
>
> When comparing the '*' to Y[2] we think of both ranges as '-'. On the 
> '-' moving from  3' to 5' is right to left so the  5 follows 7
>
> > follow(x, Y[2])
> [1] 1
>
> but does not precede it
>
> > precede(x, Y[2])
> [1] NA
>
> When we set ignore.strand=TRUE, all ranges are considered '+'.
>
> Valerie
>> whereas
>>
>>> follow(x, Y, ignore.strand=TRUE)    # correct, I think
>> [1] NA
>>
>>> sessionInfo()
>> R Under development (unstable) (2012-10-10 r60908)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>   [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915
>>   [7] LC_PAPER=C                     LC_NAME=C
>>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>
>> other attached packages:
>> [1] GenomicRanges_1.11.0 IRanges_1.17.0       BiocGenerics_0.5.0
>>
>> loaded via a namespace (and not attached):
>> [1] parallel_2.16.0 stats4_2.16.0   tools_2.16.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list