[R] data.frame and formula classes of aggregate

Mon Nov 29 19:01:07 CET 2010

On 2010-11-29 06:35, David Freedman wrote:
>
> Hi - I apologize for the 2nd post, but I think my question from a few weeks
> ago may have been overlooked on a Friday afternoon.
>
> I might be missing something very obvious, but is it widely known that the
> aggregate function handles missing values differently depending if a data
> frame or a formula is the first argument ?  For example,
>
> (d<- data.frame(sex=rep(0:1,each=3),
> wt=c(100,110,120,200,210,NA),ht=c(10,20,NA,30,40,50)))
> x1<- aggregate(d, by = list(d$sex), FUN = mean);
> 	names(x1)[3:4]<- c('mean.dfcl.wt','mean.dfcl.ht')
> x2<- aggregate(cbind(wt,ht)~sex,FUN=mean,data=d);
> 	names(x2)[2:3]<- c('mean.formcl.wt','mean.formcl.ht')
> cbind(x1,x2)[,c(2,3,6,4,7)]
>
> The output from the data.frame class has an NA if there are missing values
> in the group for the variable with missing values.  But, the formula class
> output seems to delete the entire row (missing and non-missing values) if
> there are any NAs.  Wouldn't one expect that the 2 forms (data frame vs
> formula) of aggregate would give the same result?
>

Wasn't there some discussion of this not long ago? Maybe I'm getting
senile. Anyway, as David W. points out, the defaults differ. Here's
how you can get the same result from both methods:

1. use na.action = na.pass in aggregate.formula;
    this will duplicate your x1 result.

2. use d <- d[complete.cases(d), ] in your x1 calculation;
    this will duplicate your x2 result.

Peter Ehlers

> thanks very much
> david freedman, atlanta
>
>
>
>