[BioC] IRanges coverage integer limit?

Hervé Pagès hpages at fhcrc.org
Fri Jul 13 03:36:58 CEST 2012


Hi Nico,

On 07/11/2012 02:29 AM, Nicolas Delhomme wrote:
> Hi Hervé,
>
> On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote:
>
>> Hi Nico,
>>
>> The overflow issue is addressed in IRanges 1.15.18 (devel).
>
> Thanks!
>
>>
>> On 07/04/2012 02:16 AM, Nicolas Delhomme wrote:
>>> Great, thanks!
>>>
>>> Hervé - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-)
>>
>> I don't think it would be tough at all. The real question is: do we want
>> coverage() to always return a numeric-Rle instead of integer-Rle? This
>> will make the Rle 50% bigger in memory, probably not a big deal. On the
>> other hand this would allow treating numeric weights really as numerics
>> instead of truncating them like we do right now:
>>
>>   > coverage(IRanges(1:3, width=10))
>>   'integer' Rle of length 12 with 5 runs
>>     Lengths: 1 1 8 1 1
>>     Values : 1 2 3 2 1
>>   > coverage(IRanges(1:3, width=10), weight=2.86)
>>   'integer' Rle of length 12 with 5 runs
>>     Lengths: 1 1 8 1 1
>>     Values : 2 4 6 4 2
>>
>> Maybe one option would be to return an integer-Rle when 'weight' is
>> integer and a numeric-Rle when it's numeric. So by default (i.e. when
>> no weights are supplied) it would still return an integer-Rle (because
>> the default for 'weight' is 1L).
>> But coverage(IRanges(1:3, width=10), weight=2) would return a
>> numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L)
>> an integer-Rle.
>>
>> How does that sound?
>
> That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave.

This is done in IRanges 1.15.20:

- with integer weights:

   > coverage(IRanges(1:3, width=10), weight=2L)
   integer-Rle of length 12 with 5 runs
     Lengths: 1 1 8 1 1
     Values : 2 4 6 4 2

- width numeric weights:

   > coverage(IRanges(1:3, width=10), weight=2)
   numeric-Rle of length 12 with 5 runs
     Lengths: 1 1 8 1 1
     Values : 2 4 6 4 2

   > coverage(IRanges(1:3, width=10), weight=2.86)
   numeric-Rle of length 12 with 5 runs
     Lengths:    1    1    8    1    1
     Values : 2.86 5.72 8.58 5.72 2.86

   > coverage(IRanges(1:3, width=10), weight=1e9)
   numeric-Rle of length 12 with 5 runs
     Lengths:     1     1     8     1     1
     Values : 1e+09 2e+09 3e+09 2e+09 1e+09

Cheers,
H.

>
> Let me know if there's something I can do to help with the changes,
>
> Nico
>
>>
>> H.
>>
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote:
>>>
>>>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote:
>>>>> Hi,
>>>>>
>>>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it:
>>>>>
>>>>> library(IRanges)
>>>>> rngs <- IRanges(c(1:100),width=100)
>>>>> coverage(rngs)
>>>>>
>>>>> 'integer' Rle of length 199 with 199 runs
>>>>>    Lengths:  1  1  1  1  1  1  1  1  1  1  1 ...  1  1  1  1  1  1  1  1  1  1
>>>>>    Values :  1  2  3  4  5  6  7  8  9 10 11 ... 10  9  8  7  6  5  4  3  2  1
>>>>>
>>>>> coverage(rngs,weight=1e9)
>>>>>
>>>>> 'integer' Rle of length 200 with 200 runs
>>>>>    Lengths:           1           1           1 ...           1           1
>>>>>    Values :  1000000000  2000000000 -1294967296 ...  1000000000           0
>>>>>
>>>>> runValue(coverage(rngs,weight=1e9))
>>>>>    [1]  1000000000  2000000000 -1294967296  -294967296   705032704  1705032704
>>>>>    [7] -1589934592  -589934592   410065408  1410065408 -1884901888  -884901888
>>>>> ...
>>>>>
>>>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored.
>>>>>
>>>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline.
>>>>>
>>>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps.
>>>>
>>>> Thanks Nico for catching this other one. I will keep operations in the
>>>> int space for now (so an 'integer' Rle is always returned) but will make
>>>> sure a warning is issued and NAs are returned in case of overflow.
>>>>
>>>> H.
>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nico
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.15.1 (2012-06-22)
>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] C/UTF-8/C/C/C/C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] IRanges_1.15.17    BiocGenerics_0.3.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] stats4_2.15.1 tools_2.15.1
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------
>>>>> Nicolas Delhomme
>>>>>
>>>>> Genome Biology Computational Support
>>>>>
>>>>> European Molecular Biology Laboratory
>>>>>
>>>>> Tel: +49 6221 387 8310
>>>>> Email: nicolas.delhomme at embl.de
>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>> 69102 Heidelberg, Germany
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list