[Rd] by() processing on a dataframe

Peter Dalgaard p.dalgaard at biostat.ku.dk
Fri Sep 30 19:41:52 CEST 2005


Duncan Murdoch <murdoch at stats.uwo.ca> writes:

> I want to calculate a statistic on a number of subgroups of a dataframe, 
> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
> think, though it's been years since I used it.)
> 
> This is possible using by(), but it seems cumbersome and fragile.  Is 
> there a more straightforward way than this?
> 
> Here's a simple example showing my current strategy:
> 
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>    gp1 gp2      value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
>     gp1 gp2  statistic
> 1    1   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
> 
> tapply() is inappropriate because I don't have all possible combinations 
> of gp1 and gp2 values, only some of them:
> 
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>           1         2        3          4
> 1 0.450926 0.6543489       NA         NA
> 2       NA        NA 1.224392 0.06356469
> 
> 
> 
> In the real case, I only have a very sparse subset of all the 
> combinations, and tapply() and by() both die for lack of memory.
> 
> Any suggestions on how to do what I want, without using SAS?

Have you tried aggregate()?

Alternatively, you migth split on interaction(...., drop=TRUE)

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907



More information about the R-devel mailing list