[R] adding rows without loops

Thu May 23 17:58:32 CEST 2013

> This is the exact solution I came up with ...

"exact", really?

Is the time-consuming part the initial merge
   DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE)

or the postprocessing to turn runs of NAs into the last non-NA
value in the column
  while(any(is.na(DFm))){
    if (any(is.na(DFm[1,]))) stop("Complete first row required!")
    ind <- which(is.na(DFm), arr.ind=TRUE)
    prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2)
    DFm[is.na(DFm)] <- DFm[prind]
 }

If it is the latter, you may get better results from applying zoo::na.locf()
to each non-key column of DFm.  E.g.,
   library(zoo)
   f2 <- function(DFm) {
      for(i in 3:length(DFm)) {
         DFm[[i]] <- na.locf(DFm[[i]])
      }
      DFm
   }
   f(DFm)
gives the same result as Blaser's algorithm
  f1 <- function (DFm)  {
     while (any(is.na(DFm))) {
         if (any(is.na(DFm[1, ]))) 
             stop("Complete first row required!")
         ind <- which(is.na(DFm), arr.ind = TRUE)
         prind <- matrix(c(ind[, "row"] - 1, ind[, "col"]), ncol = 2)
         DFm[is.na(DFm)] <- DFm[prind]
     }
     DFm
 }

If there are not a huge number of columns I would guess that f2() would be much
faster.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Adeel - SafeGreenCapital
> Sent: Thursday, May 23, 2013 5:54 AM
> To: 'Blaser Nello'; r-help at r-project.org
> Subject: Re: [R] adding rows without loops
> 
> Thank you Blaser:
> 
> This is the exact solution I came up with but when comparing 8M rows even on
> an 8G machine, one runs out of memory.  To run this effectively, I have to
> break the DF into smaller DFs, loop through them and then do a massive
> rmerge at the end.  That's what takes 8+ hours to compute.
> 
> Even the bigmemory package is causing OOM issues.
> 
> -----Original Message-----
> From: Blaser Nello [mailto:nblaser at ispm.unibe.ch]
> Sent: Thursday, May 23, 2013 12:15 AM
> To: Adeel Amin; r-help at r-project.org
> Subject: RE: [R] adding rows without loops
> 
> Merge should do the trick. How to best use it will depend on what you
> want to do with the data after.
> The following is an example of what you could do. This will perform
> best, if the rows are missing at random and do not cluster.
> 
> DF1 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100,
> VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))
> DF2 <- data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100,
> VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))
> 
> DFm <- merge(DF1, DF2, by=c("X.DATE", "X.TIME"), all=TRUE)
> 
> while(any(is.na(DFm))){
>   if (any(is.na(DFm[1,]))) stop("Complete first row required!")
>   ind <- which(is.na(DFm), arr.ind=TRUE)
>   prind <- matrix(c(ind[,"row"]-1, ind[,"col"]), ncol=2)
>   DFm[is.na(DFm)] <- DFm[prind]
> }
> DFm
> 
> Best,
> Nello
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Adeel Amin
> Sent: Donnerstag, 23. Mai 2013 07:01
> To: r-help at r-project.org
> Subject: [R] adding rows without loops
> 
> I'm comparing a variety of datasets with over 4M rows.  I've solved this
> problem 5 different ways using a for/while loop but the processing time
> is murder (over 8 hours doing this row by row per data set).  As such
> I'm trying to find whether this solution is possible without a loop or
> one in which the processing time is much faster.
> 
> Each dataset is a time series as such:
> 
> DF1:
> 
>     X.DATE X.TIME VALUE VALUE2
> 1 01052007   0200    37     29
> 2 01052007   0300    42     24
> 3 01052007   0400    45     28
> 4 01052007   0500    45     27
> 5 01052007   0700    45     35
> 6 01052007   0800    42     32
> 7 01052007   0900    45     32
> ...
> ...
> ...
> n
> 
> DF2
> 
>     X.DATE X.TIME VALUE VALUE2
> 1 01052007   0200    37     29
> 2 01052007   0300    42     24
> 3 01052007   0400    45     28
> 4 01052007   0500    45     27
> 5 01052007   0600    45     35
> 6 01052007   0700    42     32
> 7 01052007   0800    45     32
> 
> ...
> ...
> n+4000
> 
> In other words there are 4000 more rows in DF2 then DF1 thus the
> datasets are of unequal length.
> 
> I'm trying to ensure that all dataframes have the same number of X.DATE
> and X.TIME entries.  Where they are missing, I'd like to insert a new
> row.
> 
> In the above example, when comparing DF2 to DF1, entry 01052007 0600
> entry is missing in DF1.  The solution would add a row to DF1 at the
> appropriate index.
> 
> so new dataframe would be
> 
> 
>     X.DATE X.TIME VALUE VALUE2
> 1 01052007   0200    37     29
> 2 01052007   0300    42     24
> 3 01052007   0400    45     28
> 4 01052007   0500    45     27
> 5 01052007   0600    45     27
> 6 01052007   0700    45     35
> 7 01052007   0800    42     32
> 8 01052007   0900    45     32
> 
> Value and Value2 would be the same as row 4.
> 
> Of course this is simple to accomplish using a row by row analysis but
> with of 4M rows the processing time destroying and rebinding the
> datasets is very time consuming and I believe highly un-R'ish.  What am
> I missing?
> 
> Thanks!
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.