[R] Aggregate with numerous factors

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Mon Dec 18 11:55:10 CET 2006


Joachim Claudet wrote:
> Dear list members,
>
> I am facing some problems using the aggregate() function.
> I want to calculate a sum and a mean of one variable over the 
> combination of 12 factors with the aggregate() function to avoid loops 
> but it doesn't work (or the job is far too long, it exceeds 2 hours). It 
> works with a fewer number of factors, so I constructed a factor being 
> the levels combination of 7 factors (I need the other ones being on 
> their own). I had then 6 factors, but it still doesn't work.
> Could someone tell me how to fix the problem or know another function I 
> could use ?
> Thank you very much,
> Joachim Claudet.
>
>   
aggregate() is (currently) a wrapper for tapply(), so generates a table
which is indexed by the cartesian product of all the factors. If many cells
are empty, you can reduce the work by calculating the interaction factor up
front and remove levels that are not present in the data. This is pretty
much
the idea you already had, unless you forgot the bit about removing unused
levels. You could potentially extend the idea to all 12 factors, and then
extract the ones you want "on their own" from the result.

Alternatively, rewrite aggregate() and send us a patch ;-)

It is not necessarily all that hard. Here's a rough idea

IX <- as.data.frame(by)
OO <- do.call(order,IX)
Y <- x[OO,]
g <- cumsum(!duplicated(IX))
FF <- unique(IX)
cbind(FF, sapply(split(x,g),FUN))

(completely untested, of course, and if it works, it works only for a
single-column x; otherwise, you need a loop over the columns somehow.)

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907



More information about the R-help mailing list