[Rd] sum() (and similar methods) should work for zero row data.frames

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Sun Oct 18 21:49:00 CEST 2020


Peter et al,

I had the same thought, in particular for any() and all(), which in as much
as they should work on data.frames in the first place (which to be
perfectly honest i do find quite debatable myself), should certainly work
on "logical" data.frames if they are going to work on "numeric" ones.

I can volunteer to prepare a patch if Martin (the reporter) did not want to
take a crack at it, and further if it is not already being done within
R-core.

Best,
~G

On Sun, Oct 18, 2020 at 12:19 AM peter dalgaard <pdalgd using gmail.com> wrote:

> Hmm, yes, this is probably wrong. E.g., we are likely to get
> inconsistencies out of boundary cases like this
>
> > a <- na.omit(airquality)
> > sum(a)
> [1] 37495.3
> > sum(a[FALSE,])
> Error in FUN(X[[i]], ...) :
>   only defined on a data frame with all numeric variables
>
> Or, closer to an actual use case:
>
> > sum(subset(a, Ozone>100))
> [1] 3330.5
> > sum(subset(a, Ozone>200))
> Error in FUN(X[[i]], ...) :
>   only defined on a data frame with all numeric variables
>
>
> However, given that numeric summaries generally treat logicals as 0/1,
> wouldn't it be easiest just to extend the check inside Summary.data.frame
> with "&& !is.logical(x)"?
>
> > sum(as.matrix(a[FALSE,]))
> [1] 0
>
> -pd
>
> > On 17 Oct 2020, at 21:18 , Martin <rdev using mb706.com> wrote:
> >
> > The "Summary" group generics always throw errors for a data.frame with
> zero rows, for example:
> >> sum(data.frame(x = numeric(0)))
> > #> Error in FUN(X[[i]], ...) :
> > #>   only defined on a data frame with all numeric variables
> > Same behaviour for min, max, any, all, ... . I believe this is
> inconsistent with what these methods do for other empty objects (vectors,
> matrices), where the return value is chosen to ensure transitivity:
> sum(numeric(0)) == 0.
> >
> > The reason for this is that the return type of as.matrix() for empty (no
> rows or no columns) data.frame objects is always a matrix of type
> "logical". The Summary method for data.frame, in turn, throws an error when
> the data.frame, converted to a matrix, is not of numeric type.
> >
> > I suggest two ways that make sum, min, max, ... more consistent. IMHO it
> would be fitting to implement both of these fixes, because they also make
> other things more consistent.
> >
> > 1. Make the return type of as.matrix() for zero-row data.frames
> consistent with the type that would have been returned, had the data.frame
> had more than zero rows. "as.matrix(data.frame(x = numeric(0)))" should
> then be numeric, if there is an empty "character" column the return matrix
> should be a character etc. This would make subsetting by row and conversion
> to matrix commute (except for row names sometimes):
> >> all.equal(as.matrix(df[rows, , drop = FALSE]), as.matrix(df)[rows, ,
> drop = FALSE])
> > Furthermore, this change would make as.matrix.data.frame obey the
> documentation, which indicates that the coercion hierarchy is used for the
> return type.
> >
> > 2. Make the Summary.data.frame method accept data.frames that produce
> non-numeric matrices. Next to the main focus of this message, I believe it
> would e.g. be fitting to have any() and all() work on logical data.frame
> objects. The current behaviour is such that
> >> any(data.frame(x = 1))
> > #> [1] TRUE
> > #> Warning message:
> > #> In any(1, na.rm = FALSE) : coercing argument of type 'double' to
> logical
> > and
> >> any(data.frame(x = TRUE))
> > #> Error in FUN(X[[i]], ...) :
> > #>   only defined on a data frame with all numeric variables
> > So a numeric data.frame warns about implicit coercion, while a logical
> data.frame (which would not need coercion) does not work at all.
> >
> > (I feel more strongly about fixing 1. than 2., because I don't know the
> discussion that lead to the behaviour described in 2.)
> >
> > Best,
> > Martin
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list