[R] collapsing a data frame

Ben Bolker bolker at ufl.edu
Fri Oct 12 19:14:31 CEST 2007


   Trying to find a quick/slick/easily interpretable way to
collapse a data set.

  Suppose I have a data set that looks like this:

h <- structure(list(INDEX = structure(1:6, .Label = c("1", "2", "3", 
"4", "5", "6"), class = "factor"), TICKS = c(0, 0, 0, 0, 0, 3
), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501", 
"502", "503"), class = "factor"), HEIGHT = c(465, 465, 472, 475, 
475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("95", 
"96", "97"), class = "factor"), LOCATION = structure(c(1L, 1L, 
2L, 3L, 3L, 3L), .Label = c("32", "36", "37"), class = "factor")), .Names =
c("INDEX", 
"TICKS", "BROOD", "HEIGHT", "YEAR", "LOCATION"), row.names = c(NA, 
6L), class = "data.frame")

i.e., 
> h
  INDEX TICKS BROOD HEIGHT YEAR LOCATION
1     1     0   501    465   95       32
2     2     0   501    465   95       32
3     3     0   502    472   95       36
4     4     0   503    475   95       37
5     5     0   503    475   95       37
6     6     3   503    475   95       37

I want a data set that looks like this:
  BROOD TICKS.mean HEIGHT YEAR LOCATION
    501          0               465      95      32
    502          0               472      95      36
    503          1               475      95      37

(for example).  I.e.,  I want to collapse it to a dataset by brood,
taking the mean of TICKS and reducing each of 
the other variables (would be nice to allow multiple summary
statistics, e.g. TICKS.mean and TICKS.sd ...)
In some ways, this is the opposite of a database join/merge
operation -- I want to collapse the data frame back down.
If I had the "unmerged" (i.e., the brood table) handy I could
use it.

  I know I can construct this table a bit at a time,
 using tapply() or by()  or aggregate() to get the means.

  Here's a solution that takes the first element of each factor 
and the mean of each numeric variable.  I can imagine there
are more general/flexible solutions.  (One might want to 
specify more than one summary function, or specify that
factors that vary within group should be dropped.)

vtype = sapply(h,class)  ## variable types [numeric or factor]
vtypes = unique(vtype)   ## possible types
v2 = lapply(vtypes,function(z) which(vtype==z))  ## which are which?
cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply
m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) },
  v2,cfuns,SIMPLIFY=FALSE)
data.frame(m[[1]],m[[2]][-1])

  My question is whether this is re-inventing the wheel.  Is there
some function or package that performs this task?

  cheers
    Ben Bolker

-- 
View this message in context: http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list