[R] Fwd: conditionally merging adjacent rows in a data frame

Marek Janad marek.janad at gmail.com
Thu Dec 10 00:41:22 CET 2009


I've also made some comparisons and taking into account execution
time, sqldf wins. SummaryBy is better then aggregate in some specific
situations I met in practice. I present this situation below. It
assumes, that there are at least two groups with high number of
levels.

n<-100000;
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2, n, replace=T)

# sqldf
library(sqldf)
Rprof('prof');
sqldf("select grp1, grp2, avg(x), avg(y) from d group by grp1, grp2")
Rprof(NULL);
summaryRprof('prof')

#by
#do.call(rbind, by(d, list(d$grp1, d$grp2), function(x) transform(x, x
= mean(x), y = mean(y))[1,,drop = FALSE ]))

#doBy
library(doBy)
Rprof('prof');
summaryBy(x+y~grp1+grp2, data=d, FUN=c(mean))
Rprof(NULL);
summaryRprof('prof')

#aggregate
Rprof('prof');
aggregate(d, list(d$grp1, d$grp2), function(x)mean(x))
Rprof(NULL);
summaryRprof('prof')



---------- Forwarded message ----------
From: Nikhil Kaza <nikhil.list at gmail.com>
Date: 2009/12/9
Subject: Re: [R] conditionally merging adjacent rows in a data frame
To: Titus von der Malsburg <malsburg at gmail.com>
DW: r-help at r-project.org



This is great!! Sqldf is exactly the kind of thing I was looking for,
other stuff.

I suppose you can speed up both functions 1 and 5 using aggregate and
tapply only once, as was suggested earlier. But it comes at the
expense of readability.

Nikhil

On 9 Dec 2009, at 7:59AM, Titus von der Malsburg wrote:

> On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>>
>> Here are a couple of solutions.  The first uses by and the second sqldf:
>
> Brilliant!  Now I have a whole collection of solutions.  I did a simple
> performance comparison with a data frame that has 7929 lines.
>
> The results were as following (loading appropriate packages is not included in
> the measurements):
>
> times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
> names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply")
> barplot(times, log="y", ylab="log(s)")
>
> So sqldf clearly wins followed by tapply and aggregate.  summaryBy is slower
> than necessary because it computes for x and dur both, mean /and/ sum.
> by+transform presumably suffers from the contruction of many intermediate data
> frames.
>
> Are there any canonical places where R-recipes are collected?  If yes I would
> write-up a summary.
>
> These were the competitors:
>
> # Gary's and Nikhil's aggregate solution:
>
> aggregate.fixations1 <- function(d) {
>
>  idx  <- c(TRUE,diff(d$roi)!=0)
>  d2     <- d[idx,]
>
>  idx  <- cumsum(idx)
>  d2$dur <- aggregate(d$dur, list(idx), sum)[2]
>  d2$x   <- aggregate(d$x, list(idx), mean)[2]
>
>  d2
> }
>
> # Marek's symmaryBy:
>
> library(doBy)
>
> aggregate.fixations2 <- function(d) {
>
>  idx  <- c(TRUE,diff(d$roi)!=0)
>  d2     <- d[idx,]
>
>  d$idx  <- cumsum(idx)
>  d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
> mean))[c("dur.sum", "x.mean")]
>  d2
> }
>
> # Gabor's by+transform solution:
>
> aggregate.fixations3 <- function(d) {
>
>  idx  <- cumsum(c(TRUE,diff(d$roi)!=0))
>
>  d2 <- do.call(rbind, by(d, idx, function(x)
>                transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))
>
>  d2
> }
>
> # Gabor's sqldf solution:
>
> library(sqldf)
>
> aggregate.fixations4 <- function(d) {
>
>  idx  <- c(TRUE,diff(d$roi)!=0)
>  d2     <- d[idx,]
>
>  d$idx  <- cumsum(idx)
>  d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")
>
>  d2
> }
>
> # Titus' solution using plain old tapply:
>
> aggregate.fixations5 <- function(d) {
>
>  idx  <- c(TRUE,diff(d$roi)!=0)
>  d2     <- d[idx,]
>
>  idx  <- cumsum(idx)
>  d2$dur <- tapply(d$dur, idx, sum)
>  d2$x <- tapply(d$x, idx, mean)
>
>  d2
> }
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



-- 
Marek




More information about the R-help mailing list