[R] Avoiding for-loop for splitting vector into subvectorsbased on positions

Wed May 5 21:58:53 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Joris Meys
> Sent: Tuesday, May 04, 2010 2:02 PM
> To: jim holtman
> Cc: R mailing list
> Subject: Re: [R] Avoiding for-loop for splitting vector into 
> subvectorsbased on positions
> 
> Thanks, works nicely. I have to do some clocking to see how much the
> improvement is, but I surely learnt again.
> 
> Attentive readers might have noticed my initial code contains 
> an error.
> tmp <- x[pos2[i]:pos2[i+1]]
> should be:
> tmp <- x[pos2[i]:(pos2[i+1]-1)]
> off course...

I think you also wanted your for loop to run
along 1:length(pos) instead of 1:length(x).

Your subject line asked how to avoid a for loop
but you seem to be interested in how to make
your function run quickly.  These are different
questions.

The following test functions seem to show that
your time (and probably memory) problems arise
from growing a dataset:
   out <- c()
   for(i in 1:length(pos)) {
        ...
        out<-c(out, length(tmp))
   }
instead of preallocating it and inserting into it:
   out <- numeric(length(pos)) # or integer or list or ... ?
   for(i in 1:length(pos)) {
        ...
        out[i] <- length(tmp)
   }

makeData <- function (nX, nPos) {
    # make data for timing tests
    pos <- sort(sample(nX, size=nPos, replace=FALSE))
    pos[1] <- 1L
    list(x = seq_len(nX), pos = pos)
}

f0 <- function (x, pos, FUN = length) {
    # OP's code, slightly modified
    pos2 <- c(pos, length(x) + 1)
    retval <- c()
    for (i in seq_len(length(pos))) {
        tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
        retval <- c(retval, FUN(tmp))
    }
    retval
}

f1 <- function (x, pos, FUN = length) {
    # like f0 but we preallocate the result
    pos2 <- c(pos, length(x) + 1)
    retval <- numeric(length(pos))
    for (i in seq_len(length(pos))) {
        tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
        retval[i] <- FUN(tmp)
    }
    retval
}

f2 <- function (x, pos, FUN = length) {
    # use tapply
    groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
    tapply(x, groupId, FUN)
}

f3 <- function (x, pos, FUN = length) {
    # lapply(split(...))
    groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
    unlist(lapply(split(x, groupId), FUN))
}

# make one million numbers in 400 thousand groups 
z <- makeData(nX=1e6, nPos=4e5)
t0 <- system.time( r0 <- f0(z$x, z$pos) )
t1 <- system.time( r1 <- f1(z$x, z$pos) )
t2 <- system.time( r2 <- f2(z$x, z$pos) )
t3 <- system.time( r3 <- f3(z$x, z$pos) )

> rbind(t0=t0, t1=t1, t2=t2, t3=t3)
   user.self sys.self elapsed user.child sys.child
t0    429.44     3.30  425.84         NA        NA
t1      3.20     0.00    3.16         NA        NA
t2      6.91     0.01    6.72         NA        NA
t3      2.68     0.02    2.72         NA        NA

The results from each, r0-r3, are almost the same.
f1 produced a "numeric" (double precision) result
instead of an integer one (length() returns an integer).
tapply() spends time seeing if FUN always returns
the same kind of result and simplifies the answer
if it does.  The others will run into problems
if FUN doesn't always return a single number.  Choose
a method based on how general the code needs to be
and how much error checking your require.

In any case, growing a vector that is destined to be
large can take a lot of time.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> On Tue, May 4, 2010 at 5:50 PM, jim holtman 
> <jholtman at gmail.com> wrote:
> 
> > Try this:
> >
> > > x <- 1:10
> > > pos <- c(1,4,7)
> > > pat <- rep(seq_along(pos), times=diff(c(pos, length(x) + 1)))
> > > split(x, pat)
> > $`1`
> > [1] 1 2 3
> > $`2`
> > [1] 4 5 6
> > $`3`
> > [1]  7  8  9 10
> >
> >
> >
> > On Tue, May 4, 2010 at 11:29 AM, Joris Meys 
> <jorismeys at gmail.com> wrote:
> >
> >> Dear all,
> >>
> >> I'm trying to optimize code and want to avoid for-loops as much as
> >> possible.
> >> I'm applying a calculation on subvectors from a big one, 
> and I get the
> >> subvectors by using a vector of starting positions:
> >>
> >> x <- 1:10
> >> pos <- c(1,4,7)
> >> n <- length(x)
> >>
> >> I try to do something like this :
> >> pos2 <- c(pos, n+1)
> >>
> >> out <- c()
> >> for(i in 1:n){
> >>     tmp <- x[pos2[i]:pos2[i+1]]
> >>     out <- c(out, length(tmp))
> >> }
> >>
> >> Never mind the length function, I apply a far more 
> complicated one. It's
> >> about the use of the indices in the for-loop. I didn't see 
> any way of
> >> doing
> >> that with an apply, unless there is a very convenient way 
> of splitting my
> >> vector in a list of the subvectors or so.
> >>
> >> Anybody an idea?
> >> Cheers
> >> --
> >> Joris Meys
> >> Statistical Consultant
> >>
> >> Ghent University
> >> Faculty of Bioscience Engineering
> >> Department of Applied mathematics, biometrics and process control
> >>
> >> Coupure Links 653
> >> B-9000 Gent
> >>
> >> tel : +32 9 264 59 87
> >> Joris.Meys at Ugent.be
> >> -------------------------------
> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> 
> http://www.R-project.org/posting-guide.html<http://www.r-proje
ct.org/posting-guide.html>
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem that you are trying to solve?
> >
> 
> 
> 
> -- 
> Joris Meys
> Statistical Consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
> 
> Coupure Links 653
> B-9000 Gent
> 
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>