[R] How to speed up or avoid the for-loops in this example?

Tim Churches tchur at optushome.com.au
Thu Feb 15 04:25:07 CET 2007


jim holtman wrote:
> On 2/14/07, Tim Churches <tchur at optushome.com.au> wrote:
>> Any advice, tips, clues or pointers to resources on how best to speed up
>> or, better still, avoid the loops in the following example code much
>> appreciated. My actual dataset has several tens of thousands of rows and
>> lots of columns, and these loops take a rather long time to run.
>> Everything else which I need to do is done using vectors and those parts
>> all run very quickly indeed. I spent quite a while doing searches on
>> r-help and re-reading the various manuals, but couldn't find any
>> existing relevant advice. I am sure the solution is obvious, but it
>> escapes me.
>>
>> Tim C
>>
>> # create an example data frame, multiple events per subject
>>
>> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983)
>> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F)
>> subject <- c(1,1,1,2,2,3,3,3,3,4,4)
>> df <- data.frame(cbind(subject,year,event.of.interest))
>>
>> # add a per-subject sequence number
>>
>> df$subject.seq <- 1
>> for (i in 2:nrow(df)) {
>> if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <-
>> df$subject.seq[i-1] + 1
>> }
>> df
> 
> # add an event sequence number which is zero until the first
>> # event of interest for that subject happens, and then increments
>> # thereafter
>>
>> df$event.seq <- 0
>> for (i in 1:nrow(df)) {
>> if (df$subject.seq[i] == 1 ) {
>>    current.event.seq <- 0
>> }
>> if (event.of.interest[i] == 1 | current.event.seq > 0)
>> current.event.seq <- current.event.seq + 1
>> df$event.seq[i] <- current.event.seq
>> }
>> df
> 
> 
> 
> try:
> 
>> df <- data.frame(cbind(subject,year,event.of.interest))
>> df <- do.call(rbind,by(df, df$subject, function(z){z$subject.seq <-
> seq(nrow(z)); z}))
>> df
>      subject year event.of.interest subject.seq
> 1.1        1 1980                 0           1
> 1.2        1 1982                 1           2
> 1.3        1 1996                 1           3
> 2.4        2 1985                 0           1
> 2.5        2 1987                 0           2
> 3.6        3 1990                 0           1
> 3.7        3 1991                 1           2
> 3.8        3 1992                 0           3
> 3.9        3 1999                 1           4
> 4.10       4 1972                 1           1
> 4.11       4 1983                 0           2
>> # determine first event
>> df <- do.call(rbind, by(df, df$subject, function(x){
> +     # determine first event
> +     .first <- cumsum(x$event.of.interest)
> +     # create sequence after first non-zero
> +     .first <- cumsum(.first > 0)
> +     x$event.seq <- .first
> +     x
> + }))
>> df
>        subject year event.of.interest subject.seq event.seq
> 1.1.1        1 1980                 0           1         0
> 1.1.2        1 1982                 1           2         1
> 1.1.3        1 1996                 1           3         2
> 2.2.4        2 1985                 0           1         0
> 2.2.5        2 1987                 0           2         0
> 3.3.6        3 1990                 0           1         0
> 3.3.7        3 1991                 1           2         1
> 3.3.8        3 1992                 0           3         2
> 3.3.9        3 1999                 1           4         3
> 4.4.10       4 1972                 1           1         1
> 4.4.11       4 1983                 0           2         2

Thanks Jim, that works a treat, over an order of magnitude faster than
the for-loops.

Anders Nielsen also provided this solution:

  df$subject.seq<-unlist(tapply(df$subject,
                                  df$subject,
                                  function(x)1:length(x)
                               )
                        )

Doing it that way is about 5 times faster than using rbind(). But Jim's
use of cumsum on the logical vector is very nifty.

I have now combined Jim's function with Anders' column-oriented approach
and the result is that my code now runs about two orders of magnitude
faster.

Many thanks,

Tim C



More information about the R-help mailing list