[Rd] Fastest non-overlapping binning mean function out there?

Thu Oct 4 22:45:30 CEST 2012

Thank you all - I appreciate all your suggestions.  My post was mainly
to check if there's already an existing and fast implementation out
there.  I've ended up adopting Martin's first proposal.  Something
like this:

#include <Rdefines.h>
#include <R.h>
#include <R_ext/Error.h>

SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
  int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
  double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
  SEXP ans = PROTECT(NEW_NUMERIC(nb));
  double *ansp = REAL(ans);
  int retcount = LOGICAL(retCount)[0];
  SEXP count = NULL;
  int *countp = NULL;
  int i = 0, j = 0, n = 0, iStart=0;
  double sum = 0.0;

  // Assert same lengths of 'x' and 'y'
  if (ny != nx) {
    error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
  }

  if (retcount) {
    count = PROTECT(NEW_INTEGER(nb));
    countp = INTEGER(count);
  }

  // Skip to the first bin
  while (xp[iStart] < bxp[0]) {
    ++iStart;
  }

  // For each x...
  for (i = iStart; i < nx; ++i) {
    // Skip to a new bin?
    while (xp[i] >= bxp[j+1]) {
      // Update statistic of current bin?
      if (retcount) { countp[j] = n; }
      ansp[j] = n > 0 ? sum / n : 0;
      sum = 0.0;
      n = 0;
      // ...and move to next
      ++j;
    }
    sum += yp[i];
    n += 1;
  }

  // Update statistic of last bin
  ansp[j] = n > 0 ? sum / n : 0;
  if (retcount) {
    countp[j] = n;
    setAttrib(ans, install("count"), count);
    UNPROTECT(1); // 'count'
  }
  UNPROTECT(1); // 'ans'

  return ans;
} // binMeans()

BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6)
and not a scalar (10^10), but the misinterpretation got Martin to
point out some of the exciting work extending R core to support "very
large" integers.  And, as Herve pointed out, I forgot to say that the
number of bins could be of that range as well, or maybe an order of
magnitude less (say ~1-100 data points per bin).

/Henrik

On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 10/3/2012 6:47 AM, Martin Morgan wrote:
>>
>> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>>>
>>> Hi,
>>>
>>> I'm looking for a super-duper fast mean/sum binning implementation
>>> available in R, and before implementing z = binnedMeans(x y) in native
>>> code myself, does any one know of an existing function/package for
>>> this?  I'm sure it already exists.  So, given data (x,y) and B bins
>>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
>>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
>>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B].  Let's
>>> assume there are no missing values and 'x' and 'bx' is already
>>> ordered.  The length of 'x' is in the order of 10,000-millions.  The
>>> number of elements in each bin vary.
>>
>>
>> since x and bx are ordered (sorting x would be expensive), the C code
>> seems
>> pretty straight-forward and memory-efficient -- create a result vector as
>> long
>> as bx, then iterate through x accumulating n and it's sum until x[i] >=
>> bx[i].
>> (I think R's implementation of mean() actually pays more attention to
>> numerical
>> error, but with this much data...)
>>
>> library(inline)
>> binmean <- cfunction(signature(x="numeric", bx="numeric"),
>> "   int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
>
>
> I'll take my solution back. The problem specification says that x has
> 10,000-millions of elements, so we need to use R-devel and
>
>     R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
>
> but there are two further issues. The first is that on my system
>
> p$ gcc --version
> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>
> I have __SIZEOF_SIZE_T__ 8 but
>
> (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I
> end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using
> the inline package, to avoid that level of uncertainty) and then
>
>
>>      SEXP ans = PROTECT(NEW_NUMERIC(nb));
>>      double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
>>      sum = j = n = 0;
>>      for (i = 0; i < nx; ++i) {
>
>
> (b) because nx is a signed type, and since nx > .Machine$integer.max is
> represented as a negative number, I don't ever iterate this loop. So I'd
> have to be more clever if I wanted this to work on all platforms.
>
> For what it's worth, Herve's solution is also problematic
>
>> xx = findInterval(bx, x)
> Error in findInterval(bx, x) : long vector 'vec' is not supported
>
> A different strategy for the problem at hand would seem to involve iteration
> over sequences of x, collecting sufficient statistics (n, sum) for each
> iteration, and calculating the mean at the end of the day. This might also
> result in better memory use and allow parallel processing.
>
> Martin
>
>
>>          while (xp[i] >= bxp[j]) {
>>               ansp[j++] = n > 0 ? sum / n : 0;
>>               sum = n = 0;
>>          }
>>          n += 1;
>>          sum += xp[i];
>>      }
>>      ansp[j] = n > 0 ? sum / n : 0;
>>      UNPROTECT(1);
>>      return ans;
>> ")
>>
>> with a test case
>>
>> nx <- 4e7
>> nb <- 1e3
>> x <- sort(runif(nx))
>> bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
>>
>> leading to
>>
>>  > bx1 <- c(bx[-1], bx[nb] + 1)
>>  > system.time(res <- binmean(x, bx1))
>>     user  system elapsed
>>    0.052   0.000   0.050
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Henrik
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109