[BioC] nearest() for GRanges

Valerie Obenchain vobencha at fhcrc.org
Mon Jun 25 20:53:12 CEST 2012


This is now fixed in release, BiocC 2.10, GenomicRanges 1.8.7.

Note the behavior of '*' is different from previous behavior (i.e., <= v 
1.8.6). Treatment of '*' ranges was one of the aspects we clarified and 
enforced in the the recent update of precede, follows and nearest.

Previously in release '*' was treated as a '+' range,

g <- GRanges("chr1", IRanges(c(1,5,10), c(2,7,12)), "*")
 > g
GRanges with 3 ranges and 0 elementMetadata cols:
       seqnames    ranges strand
<Rle> <IRanges> <Rle>
   [1]     chr1  [ 1,  2]      *
   [2]     chr1  [ 5,  7]      *
   [3]     chr1  [10, 12]      *
   ---
   seqlengths:
    chr1
      NA
 > precede(g)
[1]  2  3 NA
 > follow(g)
[1] NA  1  2
 > nearest(g)
[1] 2 1 2


The new behavior of '*' (in both release and devel) considers both '+' 
and '-' possibilities. For details see the 'matching by strand' section 
described in precede() on the man page for ?GRanges.

 > precede(g)
[1] 2 1 2
 > follow(g)
[1] 2 1 2
 > nearest(g)
[1] 2 1 2


Valerie

On 06/22/2012 03:25 PM, Cook, Malcolm wrote:
> Great news, Valerie... thanks very much... I will take immediate advantage
> of this... after re-reading your report of 'an overhaul' I would well
> understand if back-porting your fix in dev to release would be onerous to
> impossible.
>
> I hope it goes quickly and smoothly....
>
> Cheers,
>
> Malcolm
>
>
> On 6/22/12 4:00 PM, "Valerie Obenchain"<vobencha at fhcrc.org>  wrote:
>
>> On 06/20/2012 05:20 PM, Cook, Malcolm wrote:
>>> Hi Valerie,
>>>
>>> Very glad you found and fixed the root cause.
>>>
>>> I don't know the overhead it would take for you, but, this being a
>>> regression, might you consider fixing in Bioconductor 2.10 as, say
>>> GenomicRanges_1.8.
>>>
>> Yes, I will fix this in release too. If not today then first thing next
>> week.
>>
>> Valerie
>>> Thanks for your consideration,
>>>
>>> Malcolm
>>>
>>> On 6/20/12 3:13 PM, "Valerie Obenchain"<vobencha at fhcrc.org>   wrote:
>>>
>>>> Hi Oleg, Malcom,
>>>>
>>>> Thanks for the bug report. This is now fixed in devel 1.9.28.  Over the
>>>> past months we've done an overhaul of the precede/follow code in devel.
>>>> The new nearest method is based on the new precede and follow and is
>>>> documented at
>>>>
>>>> ?'nearest,GenomicRanges,GenomicRanges-method'
>>>>
>>>> Let me know if you run into problems.
>>>>
>>>> Valerie
>>>>
>>>>
>>>>
>>>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote:
>>>>> Martin, Oleg, Val, all,
>>>>>
>>>>> I too have script logic that benefitted from and depends upon what the
>>>>> behavior of nearest,GenomicRanges,missing as reported by Oleg.
>>>>>
>>>>> Thanks for the unit tests Martin.
>>>>>
>>>>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if
>>>>> my
>>>>> memory serves), but does not pass now, as the attached transcript
>>>>> shows.
>>>>>
>>>>> Hoping it helps find a speedy resolution to this issue,
>>>>>
>>>>> ~ Malcolm Cook
>>>>>
>>>>>
>>>>>
>>>>>>     r<- IRanges(c(1,5,10), c(2,7,12))
>>>>>>     g<- GRanges("chr1", r, "+")
>>>>>>     checkEquals(precede(r), precede(g))
>>>>> [1] TRUE
>>>>>>      checkEquals(follow(r), follow(g))
>>>>> [1] TRUE
>>>>>>     try(checkEquals(nearest(r), nearest(g)))
>>>>> Error in checkEquals(nearest(r), nearest(g)) :
>>>>>      Mean relative difference: 0.6
>>>>>
>>>>>
>>>>>> sessionInfo()
>>>>> R version 2.15.0 (2012-03-30)
>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] C
>>>>>
>>>>> attached base packages:
>>>>>     [1] tools     splines   parallel  stats     graphics  grDevices
>>>>> utils
>>>>> datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>>     [1] RUnit_0.4.26          log4r_0.1-4           vwr_0.1
>>>>> RecordLinkage_0.4-1   ffbase_0.5            ff_2.2-7
>>>>> bit_1.1-8             evd_2.2-6             ipred_0.8-13
>>>>> prodlim_1.3.1         KernSmooth_2.23-7     nnet_7.3-1
>>>>> survival_2.36-14      mlbench_2.1-0         MASS_7.3-18
>>>>> ada_2.0-2             rpart_3.1-53          e1071_1.6
>>>>> class_7.3-3           XLConnect_0.1-9       XLConnectJars_0.1-4
>>>>> rJava_0.9-3           latticeExtra_0.6-19   RColorBrewer_1.0-5
>>>>> lattice_0.20-6        doMC_1.2.5            multicore_0.1-7
>>>>> [28] BSgenome_1.24.0       rtracklayer_1.16.1    Rsamtools_1.8.5
>>>>> Biostrings_2.24.1     GenomicFeatures_1.8.1 AnnotationDbi_1.18.1
>>>>> GenomicRanges_1.8.6   IRanges_1.14.3        Biobase_2.16.0
>>>>> BiocGenerics_0.2.0    data.table_1.8.0      compare_0.2-3
>>>>> svUnit_0.7-10         doParallel_1.0.1      iterators_1.0.6
>>>>> foreach_1.4.0         ggplot2_0.9.1         sqldf_0.4-6.4
>>>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1        chron_2.3-42
>>>>> gsubfn_0.6-3          proto_0.3-9.2         DBI_0.2-5
>>>>> functional_0.1        reshape_0.8.4         plyr_1.7.1
>>>>> [55] stringr_0.6           gtools_2.6.2
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>>     [1] RCurl_1.91-1     XML_3.9-4        biomaRt_2.12.0
>>>>> bitops_1.0-4.1
>>>>> codetools_0.2-8  colorspace_1.1-1 compiler_2.15.0  dichromat_1.2-4
>>>>> digest_0.5.2     grid_2.15.0      labeling_0.1     memoise_0.1
>>>>> munsell_0.3      reshape2_1.2.1   scales_0.2.1     stats4_2.15.0
>>>>> tcltk_2.15.0     zlibbioc_1.2.0
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/18/12 2:39 PM, "Martin Morgan"<mtmorgan at fhcrc.org>    wrote:
>>>>>
>>>>>> Hi Oleg --
>>>>>>
>>>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I just noticed that a piece of logic I was relying on with GRanges
>>>>>>> before
>>>>>>> does not seem to work anymore.  Namely, I expect the behavior of
>>>>>>> nearest()
>>>>>>> with a single GRanges object as an argument to be the same as that
>>>>>>> of
>>>>>>> IRanges (example below), but it's not anymore.  I expect
>>>>>>> nearest(GR1)
>>>>>>> NOT
>>>>>>> to behave trivially but to return the closest range OTHER than the
>>>>>>> range
>>>>>>> itself.  I could swear that was the case before, but isn't any
>>>>>>> longer:
>>>>>> I think you're right that there is an inconsistency here; Val will
>>>>>> likely help clarify in a day or so. My two cents...
>>>>>>
>>>>>> I think, certainly, that GRanges on a single chromosome on the "+"
>>>>>> strand should behave like an IRanges
>>>>>>
>>>>>>      library(GenomicRanges)
>>>>>>      library(RUnit)
>>>>>>
>>>>>>      r<- IRanges(c(1,5,10), c(2,7,12))
>>>>>>      g<- GRanges("chr1", r, "+")
>>>>>>
>>>>>>      ## first two ok, third should work but fails
>>>>>>      checkEquals(precede(r), precede(g))
>>>>>>      checkEquals(follow(r), follow(g))
>>>>>>      try(checkEquals(nearest(r), nearest(g)))
>>>>>>
>>>>>> Also, on the "-" strand I think we're expecting
>>>>>>
>>>>>>      g<- GRanges("chr1", r, "-")
>>>>>>
>>>>>>      ## first two ok, third should work but fails
>>>>>>      checkEquals(follow(r), precede(g))
>>>>>>      checkEquals(precede(r), follow(g))
>>>>>>      try(checkEquals(nearest(r), nearest(g)))
>>>>>>
>>>>>> For "*" (which was your example) I think the situation is (a)
>>>>>> different
>>>>>> in devel than in release; and (b) not so clear. In devel, "*" is
>>>>>> (from
>>>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to
>>>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think
>>>>>> these
>>>>>> are always true:
>>>>>>
>>>>>>      checkEquals(precede(g), follow(g))
>>>>>>      checkEquals(nearest(r), follow(g))
>>>>>>
>>>>>> I would also expect
>>>>>>
>>>>>>      try(checkEquals(nearest(g), follow(g)))
>>>>>>
>>>>>> though this seems not to be the case. In 'release', "*" is coereced
>>>>>> and
>>>>>> behaves as if on the "+" strand (I think).
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12))
>>>>>>>> z
>>>>>>> IRanges of length 3
>>>>>>>         start end width
>>>>>>> [1]     1   2     2
>>>>>>> [2]     5   7     3
>>>>>>> [3]    10  12     3
>>>>>>>> nearest(z)
>>>>>>> [1] 2 1 2
>>>>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10),
>>>>>>> end=c(2,7,12)))
>>>>>>>> z
>>>>>>> GRanges with 3 ranges and 0 elementMetadata cols:
>>>>>>>           seqnames    ranges strand
>>>>>>>              <Rle>     <IRanges>      <Rle>
>>>>>>>       [1]     chr1  [ 1,  2]      *
>>>>>>>       [2]     chr1  [ 5,  7]      *
>>>>>>>       [3]     chr1  [10, 12]      *
>>>>>>>       ---
>>>>>>>       seqlengths:
>>>>>>>        chr1
>>>>>>>          NA
>>>>>>>> nearest(z)
>>>>>>> [1] 1 2 3
>>>>>>>> sessionInfo()
>>>>>>> R version 2.15.0 (2012-03-30)
>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>>      [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>>      [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>>      [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>>>      [7] LC_PAPER=C                 LC_NAME=C
>>>>>>>      [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] datasets  utils     grDevices graphics  stats     methods   base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3      BiocGenerics_0.2.0
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] stats4_2.15.0
>>>>>>>
>>>>>>>
>>>>>>> I want the IRanges behavior and not what seems currently to be the
>>>>>>> GRanges
>>>>>>> behavior, since I have some code that depends on it. Is there a
>>>>>>> quick
>>>>>>> way
>>>>>>> to make nearest() do that for me again?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Oleg.
>>>>>>>
>>>>>>> 	[[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> -- 
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list