[R] Group by a data frame with multiple columns

Sun Aug 4 06:07:32 CEST 2013

Hi,

May be you should try ?data.table().  

Please use ?dput(). 

dat1<- read.table(text="
Area Sex Year y
Bob F 2011 1
Bob F 2011 2
Bob F 2012 3
Bob M 2012 3
Bob M 2012 2
Fred F 2011 1
Fred F 2011 1
Fred F 2012 2
Fred M 2012 3
Fred M 2012 1
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(data.table)
 dt2<-dt1[,sum(y),by=list(Area,Sex,Year)]
 dt2
#   Area Sex Year V1
#1:  Bob   F 2011  3
#2:  Bob   F 2012  3
#3:  Bob   M 2012  5
#4: Fred   F 2011  2
#5: Fred   F 2012  2
#6: Fred   M 2012  4

#Speed
set.seed(28)
dat2<- data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE))
system.time(datTest<- aggregate(y~.,data=dat2,sum))
#   user  system elapsed 
# 18.056   1.336  19.424 
datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),]
row.names(datTest2)<- 1:nrow(datTest2)
dtTest<- data.table(dat2)
 system.time({
 setkey(dtTest,Area,Sex,Year)
dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]})
# user  system elapsed 
#  1.232   0.184   1.418 
 setnames(dtTest2,"V1","y")
identical(datTest2,as.data.frame(dtTest2))
#[1] TRUE

A.K.

----- Original Message -----
From: Michael Liaw <michael.liaw at hotmail.com>
To: r-help at r-project.org
Cc: 
Sent: Saturday, August 3, 2013 8:11 PM
Subject: [R] Group by a data frame with multiple columns

Hi

I'm trying to manipulate a data frame (that has about 10 million rows) rows
by "grouping" it with multiple columns. For example, say the data set looks
like:

Area

Sex

Year

y

Bob

F

2011

1

Bob

F

2011

2

Bob

F

2012

3

Bob

M

2012

3

Bob

M

2012

2

Fred

F

2011

1

Fred

F

2011

1

Fred

F

2012

2

Fred

M

2012

3

Fred

M

2012

1

And I want it to look like

Area

Sex

Year

Sum of y

Bob

F

2011

3

Bob

F

2012

3

Bob

M

2012

5

Fred

F

2011

2

Fred

F

2012

2

Fred

M

2012

4

I think I can use something like:

tmp <- aggregate (y ~ ., sum)

But due to the size it's really taking a strain on the computer (even with
64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :().  The
reason for me wanting the data set to get into this form is I want to then
apply the population information and get the "rate" on the "sum of y" column
then fit a Poisson regression model.

I'm wondering (and would appreciate comments) whether there is a more
efficient way to the process I described? 

Cheers

Michael

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.