[Rd] RFC: tapply(*, ..., init.value = NA)

Martin Maechler maechler at stat.math.ethz.ch
Fri Jan 27 09:34:09 CET 2017


    > On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org>
    > wrote:

    > It would be cool if the default for tapply's init.value could be
    > FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
    > FUN=all, -Inf for FUN=max, etc.  But that would take time and would
    > break code for which FUN did not work on length-0 objects.

    > Bill Dunlap
    > TIBCO Software
    > wdunlap tibco.com

I had the same idea (after my first post), so I agree that would
be nice. One could argue it would take time only if the user is too lazy
to specify the value,  and we could use 
   tryCatch(FUN(X[0]), error = NA)
to safeguard against those functions that fail for 0 length arg.

But I think the main reason for _not_ setting such a default is
back-compatibility.  In my proposal, the new argument would not
be any change by default and so all current uses of tapply()
would remain unchanged.

>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>     on Thu, 26 Jan 2017 07:57:08 -0800 writes:

    > On a related note, the storage mode should try to match ans[[1]] (or
    > unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full
    > copy.

Yes, related indeed; and would fall "in line" with Bill's idea.
OTOH, it could be implemented independently,
by something like

   if(missing(init.value))
     init.value <-
       if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]]))
       else NA

.............

A colleague proposed to use the shorter argument name 'default'
instead of 'init.value'  which indeed maybe more natural and
still not too often used as "non-first" argument in  FUN(.).

Thank you for the constructive feedback!
Martin

    > On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
    > <maechler at stat.math.ethz.ch> wrote:
    >> Last week, we've talked here about "xtabs(), factors and NAs",
    -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
    >> 
    >> In the mean time, I've spent several hours on the issue
    >> and also committed changes to R-devel "in two iterations".
    >> 
    >> In the case there is a *Left* hand side part to xtabs() formula,
    >> see the help page example using 'esoph',
    >> it uses  tapply(...,  FUN = sum)   and
    >> I now think there is a missing feature in tapply() there, which
    >> I am proposing to change.
    >> 
    >> Look at a small example:
    >> 
    >>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
    > N=3)[-c(1,5), ]; xtabs(~., D2)
    >> , , N = 3
    >> 
    >> L
    >> n   A B C D E F
    >> 1 1 2 0 0 0 0
    >> 2 0 0 1 2 0 0
    >> 3 0 0 0 0 2 2
    >> 
    >>> DN <- D2; DN[1,"N"] <- NA; DN
    >> n L  N
    >> 2  1 A NA
    >> 3  1 B  3
    >> 4  1 B  3
    >> 6  2 C  3
    >> 7  2 D  3
    >> 8  2 D  3
    >> 9  3 E  3
    >> 10 3 E  3
    >> 11 3 F  3
    >> 12 3 F  3
    >>> with(DN, tapply(N, list(n,L), FUN=sum))
    >> A  B  C  D  E  F
    >> 1 NA  6 NA NA NA NA
    >> 2 NA NA  3  6 NA NA
    >> 3 NA NA NA NA  6  6
    >>> 
    >> 
    >> and as you can see, the resulting matrix has NAs, all the same
    >> NA_real_, but semantically of two different kinds:
    >> 
    >> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
    >> 2) all other NAs come from the fact that there is no such factor
    > combination
    >> *and* from the fact that tapply() uses
    >> 
    >> array(dim = .., dimnames = ...)
    >> 
    >> i.e., initializes the array with NAs  (see definition of 'array').
    >> 
    >> My proposition is the following patch to  tapply(), adding a new
    >> option 'init.value':
    >> 
    >> ------------------------------------------------------------
    > -----------------
    >> 
    >> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
    >> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify
    > = TRUE)
    >> {
    >> FUN <- if (!is.null(FUN)) match.fun(FUN)
    >> if (!is.list(INDEX)) INDEX <- list(INDEX)
    >> @@ -44,7 +44,7 @@
    >> index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
    >> ans <- lapply(X = ans[index], FUN = FUN, ...)
    >> if (simplify && all(lengths(ans) == 1L)) {
    >> -       ansmat <- array(dim = extent, dimnames = namelist)
    >> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
    >> ans <- unlist(ans, recursive = FALSE)
    >> } else {
    >> ansmat <- array(vector("list", prod(extent)),
    >> 
    >> ------------------------------------------------------------
    > -----------------
    >> 
    >> With that, I can set the initial value to '0' instead of array's
    >> default of NA :
    >> 
    >>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
    >> A B C D E F
    >> 1 NA 6 0 0 0 0
    >> 2  0 0 3 6 0 0
    >> 3  0 0 0 0 6 6
    >>> 
    >> 
    >> which now has 0 counts and NA  as is desirable to be used inside
    >> xtabs().
    >> 
    >> All fine... and would not be worth a posting to R-devel,
    >> except for this:
    >> 
    >> The change will not be 100% back compatible -- by necessity: any new
    > argument for
    >> tapply() will make that argument name not available to be
    >> specified (via '...') for 'FUN'.  The new function would be
    >> 
    >>> str(tapply)
    >> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
    >> 
    >> where the '...' are passed FUN(),  and with the new signature,
    >> 'init.value' then won't be passed to FUN  "anymore" (compared to
    >> R <= 3.3.x).
    >> 
    >> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
    >> the probability the arg name is used in other functions).
    >> 
    >> 
    >> Opinions?
    >> 
    >> Thank you in advance,
    >> Martin
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

    > [[alternative HTML version deleted]]



More information about the R-devel mailing list