[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Henrik Bengtsson henrik.bengtsson at gmail.com
Fri Jun 2 22:58:59 CEST 2017


I second this feature request (it's understandable that this and
possibly other parts of the code was left behind / forgotten after the
introduction of long vector).

I think mean() avoids full copies, so in the meanwhile, you can work
around this limitation using:

countTRUE <- function(x, na.rm = FALSE) {
  nx <- length(x)
  if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
  nx * mean(x, na.rm = na.rm)
}

(not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)

x <- rep(TRUE, times = .Machine$integer.max+1)
object.size(x)
## 8589934632 bytes

p <- profmem::profmem( n <- countTRUE(x) )
str(n)
## num 2.15e+09
print(n == .Machine$integer.max + 1)
## [1] TRUE

print(p)
## Rprofmem memory profiling of:
## n <- countTRUE(x)
##
## Memory allocations:
##      bytes calls
## total     0


FYI / related: I've just updated matrixStats::sum2() to support
logicals (develop branch) and I'll also try to update
matrixStats::count() to count beyond .Machine$integer.max.

/Henrik

On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi,
>
> I have a long numeric vector 'xx' and I want to use sum() to count
> the number of elements that satisfy some criteria like non-zero
> values or values lower than a certain threshold etc...
>
> The problem is: sum() returns an NA (with a warning) if the count
> is greater than 2^31. For example:
>
>   > xx <- runif(3e9)
>   > sum(xx < 0.9)
>   [1] NA
>   Warning message:
>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>
> This already takes a long time and doing sum(as.numeric(.)) would
> take even longer and require allocation of 24Gb of memory just to
> store an intermediate numeric vector made of 0s and 1s. Plus, having
> to do sum(as.numeric(.)) every time I need to count things is not
> convenient and is easy to forget.
>
> It seems that sum() on a logical vector could be modified to return
> the count as a double when it cannot be represented as an integer.
> Note that length() already does this so that wouldn't create a
> precedent. Also and FWIW prod() avoids the problem by always returning
> a double, whatever the type of the input is (except on a complex
> vector).
>
> I can provide a patch if this change sounds reasonable.
>
> Cheers,
> H.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list