[BioC] IRanges coverage integer limit?

Nicolas Delhomme delhomme at embl.de
Fri Jul 13 16:29:27 CEST 2012


Thanks! That's awesome!

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On Jul 13, 2012, at 3:36 AM, Hervé Pagès wrote:

> Hi Nico,
> 
> On 07/11/2012 02:29 AM, Nicolas Delhomme wrote:
>> Hi Hervé,
>> 
>> On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote:
>> 
>>> Hi Nico,
>>> 
>>> The overflow issue is addressed in IRanges 1.15.18 (devel).
>> 
>> Thanks!
>> 
>>> 
>>> On 07/04/2012 02:16 AM, Nicolas Delhomme wrote:
>>>> Great, thanks!
>>>> 
>>>> Hervé - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-)
>>> 
>>> I don't think it would be tough at all. The real question is: do we want
>>> coverage() to always return a numeric-Rle instead of integer-Rle? This
>>> will make the Rle 50% bigger in memory, probably not a big deal. On the
>>> other hand this would allow treating numeric weights really as numerics
>>> instead of truncating them like we do right now:
>>> 
>>>  > coverage(IRanges(1:3, width=10))
>>>  'integer' Rle of length 12 with 5 runs
>>>    Lengths: 1 1 8 1 1
>>>    Values : 1 2 3 2 1
>>>  > coverage(IRanges(1:3, width=10), weight=2.86)
>>>  'integer' Rle of length 12 with 5 runs
>>>    Lengths: 1 1 8 1 1
>>>    Values : 2 4 6 4 2
>>> 
>>> Maybe one option would be to return an integer-Rle when 'weight' is
>>> integer and a numeric-Rle when it's numeric. So by default (i.e. when
>>> no weights are supplied) it would still return an integer-Rle (because
>>> the default for 'weight' is 1L).
>>> But coverage(IRanges(1:3, width=10), weight=2) would return a
>>> numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L)
>>> an integer-Rle.
>>> 
>>> How does that sound?
>> 
>> That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave.
> 
> This is done in IRanges 1.15.20:
> 
> - with integer weights:
> 
>  > coverage(IRanges(1:3, width=10), weight=2L)
>  integer-Rle of length 12 with 5 runs
>    Lengths: 1 1 8 1 1
>    Values : 2 4 6 4 2
> 
> - width numeric weights:
> 
>  > coverage(IRanges(1:3, width=10), weight=2)
>  numeric-Rle of length 12 with 5 runs
>    Lengths: 1 1 8 1 1
>    Values : 2 4 6 4 2
> 
>  > coverage(IRanges(1:3, width=10), weight=2.86)
>  numeric-Rle of length 12 with 5 runs
>    Lengths:    1    1    8    1    1
>    Values : 2.86 5.72 8.58 5.72 2.86
> 
>  > coverage(IRanges(1:3, width=10), weight=1e9)
>  numeric-Rle of length 12 with 5 runs
>    Lengths:     1     1     8     1     1
>    Values : 1e+09 2e+09 3e+09 2e+09 1e+09
> 
> Cheers,
> H.
> 
>> 
>> Let me know if there's something I can do to help with the changes,
>> 
>> Nico
>> 
>>> 
>>> H.
>>> 
>>>> 
>>>> Nico
>>>> 
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>> 
>>>> Genome Biology Computational Support
>>>> 
>>>> European Molecular Biology Laboratory
>>>> 
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>> ---------------------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote:
>>>> 
>>>>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it:
>>>>>> 
>>>>>> library(IRanges)
>>>>>> rngs <- IRanges(c(1:100),width=100)
>>>>>> coverage(rngs)
>>>>>> 
>>>>>> 'integer' Rle of length 199 with 199 runs
>>>>>>   Lengths:  1  1  1  1  1  1  1  1  1  1  1 ...  1  1  1  1  1  1  1  1  1  1
>>>>>>   Values :  1  2  3  4  5  6  7  8  9 10 11 ... 10  9  8  7  6  5  4  3  2  1
>>>>>> 
>>>>>> coverage(rngs,weight=1e9)
>>>>>> 
>>>>>> 'integer' Rle of length 200 with 200 runs
>>>>>>   Lengths:           1           1           1 ...           1           1
>>>>>>   Values :  1000000000  2000000000 -1294967296 ...  1000000000           0
>>>>>> 
>>>>>> runValue(coverage(rngs,weight=1e9))
>>>>>>   [1]  1000000000  2000000000 -1294967296  -294967296   705032704  1705032704
>>>>>>   [7] -1589934592  -589934592   410065408  1410065408 -1884901888  -884901888
>>>>>> ...
>>>>>> 
>>>>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored.
>>>>>> 
>>>>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline.
>>>>>> 
>>>>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps.
>>>>> 
>>>>> Thanks Nico for catching this other one. I will keep operations in the
>>>>> int space for now (so an 'integer' Rle is always returned) but will make
>>>>> sure a warning is issued and NAs are returned in case of overflow.
>>>>> 
>>>>> H.
>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Nico
>>>>>> 
>>>>>> sessionInfo()
>>>>>> R version 2.15.1 (2012-06-22)
>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>>> 
>>>>>> locale:
>>>>>> [1] C/UTF-8/C/C/C/C
>>>>>> 
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>> 
>>>>>> other attached packages:
>>>>>> [1] IRanges_1.15.17    BiocGenerics_0.3.0
>>>>>> 
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] stats4_2.15.1 tools_2.15.1
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------
>>>>>> Nicolas Delhomme
>>>>>> 
>>>>>> Genome Biology Computational Support
>>>>>> 
>>>>>> European Molecular Biology Laboratory
>>>>>> 
>>>>>> Tel: +49 6221 387 8310
>>>>>> Email: nicolas.delhomme at embl.de
>>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>>> 69102 Heidelberg, Germany
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Hervé Pagès
>>>>> 
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>> 
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Hervé Pagès
>>> 
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>> 
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>> 
>>> 
>> 
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> 



More information about the Bioconductor mailing list