[Rd] Inconsistent handling of data frames in min(), max(), and mean()

Martin Maechler maechler at stat.math.ethz.ch
Fri Aug 22 10:23:45 CEST 2014

>>>>> Gavin Simpson <ucfagls at gmail.com>
>>>>>     on Thu, 21 Aug 2014 12:32:31 -0600 writes:

    > This inconsistency recently came to my attention:
    >> df <- data.frame(A = 1:10, B = rnorm(10)) 
    >> min(df)
    > [1] -1.768958
    >> max(df)
    > [1] 10
    >> mean(df)
    > [1] NA Warning message: In mean.default(df) : argument is
    > not numeric or logical: returning NA

I would tend to agree (:-) that mean() should rather give an error here
(and read on).

    > I recall the times where `mean(df)` would give
    > `colMeans(df)` and this behaviour was deemed
    > inconsistent. 
    > It seems though that the change has removed one
    > inconsistency and replaced it with another. 

The whole idea of removing the mean method for data frames was
that there are many more summary functions, e.g. median, and it
seems wrong to write a data frame method for each of them; then
why for *some* of them.
So we *did* keep the  Summary.data.frame  group method,
and that's why min(), max(), sum(),.. work  {though sum() will be
slightly slower than colSums()}.

When teaching R, the audience should learn to use  apply() or
similar functions, e.g. from the hadleyverse,
because that is the general approach of dealing with matrix-like
objects that is indeed how I think users should start thinking
of data frames.

    > Am I missing good reasons why there couldn't be a
    > `mean.data.frame()` method which worked like `max()` etc
    > when given a data frame?
yes, see above.
[ There's no consistent end after that: Why is median() different, why would
 sd(), var(), ... not work ?]

    >  Namely that they return the
    > required statistic *only* when presented with a data frame
    > of all numeric variables? E.g.

    >> df <- data.frame(A = 1:10, B = rnorm(10), C =
    >> factor(rep(c("A","B"), each
    > = 5)))
    >> max(df)
    > Error in FUN(X[[1L]], ...) : only defined on a data frame
    > with all numeric variables

    > I would expect `mean(df)` to fail with the same error as
    > for `max(df)` with the new example `df` but for would
    > return the same as `mean(as.matrix(df))` or
    > `mean(colMeans(df))` if given an entirely numeric data
    > frame:

    >> mean(colMeans(df[, 1:2]))
    > [1] 2.78366
    >> mean(as.matrix(df[, 1:2]))
    > [1] 2.78366
    >> mean(df[,1:2])
    > [1] 2.78366

    > I just can't see the sense in having `mean` work the way
    > it does now?

I agree. It would be better to give an error.
E.g.,  mean.default could start with  

       stop("there is no mean() method for ", class(x)[1], " objects")

    > Thanks,
    > Gavin

    > -- 

    > Gavin Simpson, PhD

    > 	[[alternative HTML version deleted]]
 ( hmmm... and that on R-devel ... )

More information about the R-devel mailing list