[Rd] A suggestion for an amendment to tapply

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Nov 6 08:23:56 CET 2007


On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:

> Unfortunately I think it would break too much existing code.  tapply()
> is an old function and many people have gotten used to the way it works
> now.

It is also not necessarily desirable: FUN(numeric(0)) might be an error.
For example:

> Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
> tapply(Z$x, Z$f, sd)

but sd(numeric(0)) is an error.  (Similar things involving var are 'in the 
wild' and so would be broken.)

> This is not to suggest there could not be another argument added at the
> end to indicate that you want the new behaviour, though.  e.g.
>
> tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
> handle.empty.levels = FALSE)
>
> but this raises the question of what sort of time penalty the
> modification might entail.  Probably not much for most situations, I
> suppose.  (I know this argument name looks long, but you do need a
> fairly specific argument name, or it will start to impinge on the ...
> argument.)
>
> Just some thoughts.
>
> Bill Venables.
>
> Bill Venables
> CSIRO Laboratories
> PO Box 120, Cleveland, 4163
> AUSTRALIA
> Office Phone (email preferred): +61 7 3826 7251
> Fax (if absolutely necessary):  +61 7 3826 7304
> Mobile:                         +61 4 8819 4402
> Home Phone:                     +61 7 3286 7700
> mailto:Bill.Venables at csiro.au
> http://www.cmis.csiro.au/bill.venables/
>
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
> Sent: Tuesday, 6 November 2007 3:10 PM
> To: R-Devel
> Subject: [Rd] A suggestion for an amendment to tapply
>
> Dear R-developers,
>
> when tapply() is invoked on factors that have empty levels, it returns
> NA.  This behaviour is in accord with the tapply documentation, and is
> reasonable in many cases.  However, when FUN is sum, it would also
> seem reasonable to return 0 instead of NA, because "the sum of an
> empty set is zero, by definition."
>
> I'd like to raise a discussion of the possibility of an amendment to
> tapply.
>
> The attached patch changes the function so that it checks if there are
> any empty levels, and if there are, replaces the corresponding NA
> values with the result of applying FUN to the empty set.  Eg in the
> case of sum, it replaces the NA with 0, whereas with mean, it replaces
> the NA with NA, and issues a warning.
>
> This change has the following advantage: tapply and sum work better
> together.  Arguably, tapply and any other function that has a non-NA
> response to the empty set will also work better together.
> Furthermore, tapply shows a warning if FUN would normally show a
> warning upon being evaluated on an empty set.  That deviates from
> current behaviour, which might be bad, but also provides information
> that might be useful to the user, so that would be good.
>
> The attached script provides the new function in full, and
> demonstrates its application in some simple test cases.
>
> Best wishes,
>
> Andrew
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list