[R] Group by a data frame with multiple columns

arun smartpink111 at yahoo.com
Sun Aug 4 06:07:32 CEST 2013


Hi,

May be you should try ?data.table().  

Please use ?dput(). 

dat1<- read.table(text="
Area Sex Year y
Bob F 2011 1
Bob F 2011 2
Bob F 2012 3
Bob M 2012 3
Bob M 2012 2
Fred F 2011 1
Fred F 2011 1
Fred F 2012 2
Fred M 2012 3
Fred M 2012 1
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(data.table)
 dt2<-dt1[,sum(y),by=list(Area,Sex,Year)]
 dt2
#   Area Sex Year V1
#1:  Bob   F 2011  3
#2:  Bob   F 2012  3
#3:  Bob   M 2012  5
#4: Fred   F 2011  2
#5: Fred   F 2012  2
#6: Fred   M 2012  4


#Speed
set.seed(28)
dat2<- data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE))
system.time(datTest<- aggregate(y~.,data=dat2,sum))
#   user  system elapsed 
# 18.056   1.336  19.424 
datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),]
row.names(datTest2)<- 1:nrow(datTest2)
dtTest<- data.table(dat2)
 system.time({
 setkey(dtTest,Area,Sex,Year)
dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]})
# user  system elapsed 
#  1.232   0.184   1.418 
 setnames(dtTest2,"V1","y")
identical(datTest2,as.data.frame(dtTest2))
#[1] TRUE


A.K.




----- Original Message -----
From: Michael Liaw <michael.liaw at hotmail.com>
To: r-help at r-project.org
Cc: 
Sent: Saturday, August 3, 2013 8:11 PM
Subject: [R] Group by a data frame with multiple columns

Hi



I'm trying to manipulate a data frame (that has about 10 million rows) rows
by "grouping" it with multiple columns. For example, say the data set looks
like:


Area

Sex

Year

y


Bob

F

2011

1


Bob

F

2011

2


Bob

F

2012

3


Bob

M

2012

3


Bob

M

2012

2


Fred

F

2011

1


Fred

F

2011

1


Fred

F

2012

2


Fred

M

2012

3


Fred

M

2012

1



And I want it to look like


Area

Sex

Year

Sum of y


Bob

F

2011

3


Bob

F

2012

3


Bob

M

2012

5


Fred

F

2011

2


Fred

F

2012

2


Fred

M

2012

4



I think I can use something like:

tmp <- aggregate (y ~ ., sum)



But due to the size it's really taking a strain on the computer (even with
64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :().  The
reason for me wanting the data set to get into this form is I want to then
apply the population information and get the "rate" on the "sum of y" column
then fit a Poisson regression model.



I'm wondering (and would appreciate comments) whether there is a more
efficient way to the process I described? 



Cheers

Michael












    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list