[R] Programming R to avoid loops

Jim Lemon drjimlemon at gmail.com
Sat Apr 18 09:24:11 CEST 2015


Hi Brant,
I'm a bit confused about which data frame is the one to match to, but
the following, while still including loops, should run much faster
than the above as it only matches dates within id matches.

df1<-read.table(text="id date test1.result
  a 2009-08-28      1
  a 2009-09-16      1
  b 2008-08-06      0
  c 2012-02-02      1
  c 2010-08-03      1
  c 2012-08-02      0",header=TRUE)
df2<-read.table(text="id date test2.result
  a 2011-02-03      1
  b 2011-09-27      0
  b 2011-09-01      1
  c 2009-07-16      0
  c 2009-04-15      0
  c 2010-08-10      1",header=TRUE)

bi.match<-function(x1,x2,maxdaydiff=30) {
 # convert the character strings to dates (may not be necessary)
 x1$dates<-as.Date(x1$date,"%Y-%m-%d")
 x2$dates<-as.Date(x2$date,"%Y-%m-%d")
 # initialize the l and m variables
 x1$l<-x1$m<-0
 # get all the id codes
 ids<-unique(x2$id)
 # step through the id codes
 for(id1 in ids) {
  x1ind<-which(x1$id == id1)
  x2ind<-which(x2$id == id1)
  for(id2 in 1:length(x1ind)) {
   # get the indices of the x2 dates that are within maxdaydiff days
of this x1 date
   diffok<-which(abs(x1$dates[x1ind[id2]]-x2$dates[x2ind])<=30)
   # set the date diff match indicator to 1
   x1$l[x1ind[id2]]<-length(diffok) > 0
   # set the positive test indicator to 1
   x1$m[x1ind[id2]]<-any(x2$test2.result[x2ind[diffok]] > 0)
  }
 }
 return(x1)
}

bi.match(df1,df2)

Jim


On Sat, Apr 18, 2015 at 2:14 PM, Brant Inman <brant.inman at me.com> wrote:
> I have two large data frames with the following structure:
>
>> df1
>   id       date test1.result
> 1  a 2009-08-28      1
> 2  a 2009-09-16      1
> 3  b 2008-08-06      0
> 4  c 2012-02-02      1
> 5  c 2010-08-03      1
> 6  c 2012-08-02      0
>
>> df2
>   id       date test2.result
> 1  a 2011-02-03      1
> 2  b 2011-09-27      0
> 3  b 2011-09-01      1
> 4  c 2009-07-16      0
> 5  c 2009-04-15      0
> 6  c 2010-08-10      1
>
> I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and “vectorized" so to speak.
>
> My algorithm is currently something like this code. It works but is damn slow.
>
> findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30,
>                           lagdays=30){
>   # Function to find, within subjects, two tests that occur with a timeframe
>   #
>   # test1 = the reference test result for which matching second tests are sought
>   # test2 = the second test result
>   # date1 = the date of test1
>   # date2 = the date of test2
>   # id1   = unique identifier for subject undergoing test 1
>   # id2   = unique identifier for subject undergoing test 2
>   # predays  = maximum number of days prior to test1 date that test2 date might occur
>   # lagdays  = maximum number of days after test1 date that test2 date might occur
>
>   result <- data.frame(matrix(ncol=5, nrow=length(test1)))
>     colnames(result) <- c('id','test1','date','test2count',’test2lag.result')
>     result$id    <- id1
>     result$test1 <- test1
>     result$date  <- date1
>
>   for(i in 1:length(test1)){
>     l <- 0    # Counter of test2 results that matches test1 within lag interval
>     m <- NA   # Indicator of positive test2 within lag interval
>
>     for(j in 1:length(test2)){
>       if(id1[i] == id2[j]){               # STEP1: Match IDs
>         interval <- date2[j] - date1[i]
>         intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0)
>
>         if(intmatch == 1){                # STEP2: Does test2 fall within lag interval?
>           l <- l+1                        # If test2 within lag interval, count it
>
>           if(test2[j] == 1) {             # STEP3: Is test 2 positive?
>             m <- 1                        # If test2 is positive, set indicator to 1
>           } else {
>             m <- 0
>           }
>         }
>       }
>     }
>     result$test2count[i] <- l
>     result$test2lag.result[i] <- m
>   }
>   return(result)
> }
>
> I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work.
>
> Brant Inman
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list