[R] How to do aggregate operations with non-scalar functions

Itay Furman itayf at u.washington.edu
Thu Apr 7 07:18:34 CEST 2005


On Tue, 5 Apr 2005, Gabor Grothendieck wrote:

> On Apr 5, 2005 6:59 PM, Itay Furman <itayf at u.washington.edu> wrote:
>>
>> Hi,
>>
>> I have a data set, the structure of which is something like this:
>>
>>> a <- rep(c("a", "b"), c(6,6))
>>> x <- rep(c("x", "y", "z"), c(4,4,4))
>>> df <- data.frame(a=a, x=x, r=rnorm(12))
>>
>> The true data set has >1 million rows. The factors "a" and "x"
>> have about 70 levels each; combined together they subset 'df'
>> into ~900 data frames.
>> For each such subset I'd like to compute various statistics
>> including quantiles, but I can't find an efficient way of

[snip]

>> I would like to end up with a data frame like this:
>>
>>   a x         0%        25%
>> 1 a x -0.7727268  0.1693188
>> 2 a y -0.3410671  0.1566322
>> 3 b y -0.2914710 -0.2677410
>> 4 b z -0.8502875 -0.6505710

[snip]

> One can use
>
> 	do.call("rbind", by(df, list(a = a, x = x), f))
>
> where f is the appropriate function.
>
> In this case f can be described in terms of df.quantile which
> is like quantile except it returns a one row data frame:
>
> 	df.quantile <- function(x,p)
> 		as.data.frame(t(data.matrix(quantile(x, p))))
>
> 	f <- function(df, p = c(0.25, 0.5))
> 		cbind(df[1,1:2], df.quantile(df[,"r"], p))
>

Thanks!  Just what I wanted.

A minor point is that for some reason the row numbers in the 
final data frame are not sequential (see below -- this is not a 
consequence of my changes).

Actually, seeing your code I became greedy and decided to 
extract more summary statistics in one blow like this:

df.summary <- function(x, qtils=(0:4)/4)
 	cbind(data.frame(mean=mean(x), var=var(x),
 		 length=length(x)),
 	as.data.frame(t(data.matrix(quantile(x, qtils)))))

f <- function(x, qtils=(0:4)/4)
 	cbind(x[1,1:2], df.summary(x[,"r"], qtils))

> do.call("rbind", by(df, list(a = a, x = x), f))
   a x       mean         var length         0%        25%        50%
1 a x  0.2901207 0.522191469      4 -0.7727268  0.1693188  0.5523356
5 a y  0.6543314 1.981636402      2 -0.3410671  0.1566322  0.6543314
7 b y -0.2440109 0.004504928      2 -0.2914710 -0.2677410 -0.2440109
9 b z  0.4523763 1.841469995      4 -0.8502875 -0.6505710  0.4717093
          75%       100%
1  0.6731375  0.8285385
5  1.1520307  1.6497299
7 -0.2202808 -0.1965508
9  1.5746565  1.7163741


What remains a puzzle to me is why R has a native subsetting 
function that returns a scalar per subset [aggregate()],  another 
one that returns a list [by()],  but no function that is able to 
return a vector per subset.  Is there a less demand to such 
operation (like extracting summary statistics in one blow)?  Is 
it less general?  Or technically more difficult to achieve?
I'm just curious.

 	Itay

----------------------------------------------------------------
itayf at u.washington.edu  /  +1 (206) 543 9040  /  U of Washington




More information about the R-help mailing list