[R] Improve code efficient with do.call, rbind and split contruction

Fri Sep 2 19:51:11 CEST 2016

This is the sort of thing that dplyr or the data.table packages can
probably do elegantly and efficiently. So you might consider looking
at them. But as I use neither, let me suggest a base R solution. As
you supplied no data for a reproducible example, I'll make up my own
and hopefully I have understood you correctly. If not, maybe someone
else will get it straight. Anyway...

The "trick" is to use tapply() to select the necessary row indices of
your data frame and forget about all the do.call and rbind stuff. e.g.

> set.seed(1001)
> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
+                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
+                  y = runif(100))
>
> ix <- seq_len(nrow(df))
>
> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
> ix
   a  b   c  d  e  f
A 94 69 100 59 80 87
B 89 57  65 90 75 88
C 85 92  86 95 97 62
D 47 73  72 74 99 96

## ix can now be used as an index into df as:
df[ix,]

This should help somewhat, but you still have to contend with the
tapply() loop at the interpreted level. I'll leave speed comparisons
to you.

Cheers,
Bert

## Note: if, in fact, your data frame is arranged in a regular way
with, e.g. your SID, DOSENO groups all of the same size and together,
then you can calculate the indices you want directly and skip the
tapply business.I'm assuming this is not the case... Again, no data...

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> Dear list,
>
> I have the following line of code to extract the last line of the split
> data and put them back together.
>
> do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),]))
>
> the problem is when  have a huge dataset, it takes too long to run.
> (actually it's > 3 hours and it's still running).
>
> The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
>
> Jun
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.