# [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler maechler at stat.math.ethz.ch
Wed Jun 7 12:54:19 CEST 2017

```>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 6 Jun 2017 09:45:44 +0200 writes:

>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:

>> Hi, I have a long numeric vector 'xx' and I want to use
>> sum() to count the number of elements that satisfy some
>> criteria like non-zero values or values lower than a
>> certain threshold etc...

>> The problem is: sum() returns an NA (with a warning) if
>> the count is greater than 2^31. For example:

>>> xx <- runif(3e9) sum(xx < 0.9)
>> [1] NA Warning message: In sum(xx < 0.9) : integer
>> overflow - use sum(as.numeric(.))

>> This already takes a long time and doing
>> sum(as.numeric(.)) would take even longer and require
>> allocation of 24Gb of memory just to store an
>> intermediate numeric vector made of 0s and 1s. Plus,
>> having to do sum(as.numeric(.)) every time I need to
>> count things is not convenient and is easy to forget.

>> It seems that sum() on a logical vector could be modified
>> to return the count as a double when it cannot be
>> represented as an integer.  Note that length() already
>> does this so that wouldn't create a precedent. Also and
>> FWIW prod() avoids the problem by always returning a
>> double, whatever the type of the input is (except on a
>> complex vector).

>> I can provide a patch if this change sounds reasonable.

> This sounds very reasonable, thank you Hervé, for the
> report, and even more for a (small) patch.

I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.

For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.)  and that is
not something we have done / wanted to do in the past...  Speed
being one reason.

So this ends up being more delicate than I had thought at first,
because changing  sum(<logical>)  only would mean that

sum(LOGI)   	  		  and
sum(as.integer(LOGI))

would start differ for a logical vector LOGI.

So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.

I'm sorry for having raised possibly unrealistic expectations.
Martin

> Martin

>> Cheers, H.

>> --
>> Hervé Pagès

>> Program in Computational Biology Division of Public
>> Health Sciences Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>> 98109-1024

>> E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax:
>> (206) 667-1319

>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

```