[BioC] Integer overflow when summing an 'integer' Rle

Nicolas Delhomme delhomme at embl.de
Tue Feb 14 17:35:48 CET 2012


Salut Hervé,

Bonne année! Well, we're already mid-Feb, but still most of it is in front of us ;-)

On 10 Feb 2012, at 19:30, Hervé Pagès wrote:

> Salut Nico,
> 
> On 02/10/2012 08:04 AM, Nicolas Delhomme wrote:
>> Hi all,
>> 
>> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow.
>> 
>> library(IRanges)
>> rC<- Rle(values=as.integer(c(1,(2^31)-1,1)))
>> sum(rC)
>> mean(rC)
>> 
>> Both result in an integer overflow.
>> 
>> [1] NA
>> Warning message:
>> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
>>   Integer overflow - use sum(as.numeric(.))
>> 
>> The solution to  that is to do the following:
>> 
>> sum(as.numeric(runLength(rC) * runValue(rC)))
> 
> Another solution is to convert the 'integer' Rle into a 'numeric' Rle
> before doing sum(). Unfortunately, since we don't have separate
> classes for those (like for example an IntegerRle and a DoubleRle
> class) it cannot be done using direct coercion i.e. with something
> like:
> 
>  as(rC, "DoubleRle")
> 
> (Maybe we should have individual Rle subclasses for 'integer' Rle,
> 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...)
> 

That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great.

> So for now, this conversion must be done with:
> 
> > class(runValue(rC)) <- "double"
> > rC
> 'numeric' Rle of length 3 with 3 runs
>  Lengths:          1          1          1
>  Values :          1 2147483647          1
> 
> This works fine with an Rle, but not so much with an RleList where
> one needs to do some ugly contortions in order to succeed.

Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though.

> 
> Alternatively to having individual Rle subclasses maybe we could have
> an accessor e.g. rleValueType(), with getter and setters, so we could
> do:
> 
> > rleValueType(rC)
> [1] "integer"
> > rleValueType(rC) <- "double"
> 
> and that would work on Rle and RleList objects.
> 

That would indeed be very useful and probably easier to implement.

> Anyway, even though I think having an easy/unified way for changing
> the type of the values in Rle/RleList objects is important, maybe
> I'm going slightly off-topic.
> 
> What we should definitely do now is replace this warning:
> 
>  Warning message:
>  In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
>     Integer overflow - use sum(as.numeric(.))
> 
> by a more appropriate one (doing as.numeric() on an Rle is not a good
> idea).
> 

Indeed.


>> 
>> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range.
> 
> I agree for mean() so I'll fix that.
> 
> But for sum()... "calculating values outside the integer range",
> even if the result of this calculation itself is not in the
> integer range? base::sum() will return NA if the result is not in
> the integer range and I think that's the right thing to do.
> I don't like the idea of sum() returning a double when the input
> is integer.
> 

I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best.

Thanks for the detailed answer and for the slightly-off topic "diversion" .

Cheers,

Nico

> Cheers,
> H.
> 
>> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean?
>> 
>> Looking forward to hearing your thoughts on this,
>> 
>> Cheers,
>> 
>> Nico
>> 
>> sessionInfo()
>> R Under development (unstable) (2012-02-07 r58290)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] IRanges_1.13.24    BiocGenerics_0.1.4
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.0
>> 
>> 
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list