[R] Improve code efficient with do.call, rbind and split contruction
jun.shen.ut at gmail.com
Fri Sep 2 20:37:26 CEST 2016
This is the best method I have seen this year! do.call, rbind has just gone
to museum :)
It took ~30 second to get the results. You deserve a medal!!!!
On Fri, Sep 2, 2016 at 1:51 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
> This is the sort of thing that dplyr or the data.table packages can
> probably do elegantly and efficiently. So you might consider looking
> at them. But as I use neither, let me suggest a base R solution. As
> you supplied no data for a reproducible example, I'll make up my own
> and hopefully I have understood you correctly. If not, maybe someone
> else will get it straight. Anyway...
> The "trick" is to use tapply() to select the necessary row indices of
> your data frame and forget about all the do.call and rbind stuff. e.g.
> > set.seed(1001)
> > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
> + g <- factor(sample(letters[1:6],100,rep=TRUE)),
> + y = runif(100))
> > ix <- seq_len(nrow(df))
> > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
> > ix
> a b c d e f
> A 94 69 100 59 80 87
> B 89 57 65 90 75 88
> C 85 92 86 95 97 62
> D 47 73 72 74 99 96
> ## ix can now be used as an index into df as:
> This should help somewhat, but you still have to contend with the
> tapply() loop at the interpreted level. I'll leave speed comparisons
> to you.
> ## Note: if, in fact, your data frame is arranged in a regular way
> with, e.g. your SID, DOSENO groups all of the same size and together,
> then you can calculate the indices you want directly and skip the
> tapply business.I'm assuming this is not the case... Again, no data...
> Bert Gunter
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> > Dear list,
> > I have the following line of code to extract the last line of the split
> > data and put them back together.
> > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','
> > the problem is when have a huge dataset, it takes too long to run.
> > (actually it's > 3 hours and it's still running).
> > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> > totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
> > Jun
> > [[alternative HTML version deleted]]
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> > and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help