[R] Efficient passing through big data.frame and modifying select

Tue Nov 25 22:02:23 CET 2008

> -----Original Message-----
> From: William Dunlap 
> Sent: Tuesday, November 25, 2008 9:16 AM
> To: 'johannes_graumann at web.de'
> Subject: Re: [R] Efficient passing through big data.frame and 
> modifying select fields
> 
> > Johannes Graumann johannes_graumann at web.de
> > Tue Nov 25 15:16:01 CET 2008
> > 
> > Hi all,
> > 
> > I have relatively big data frames (> 10000 rows by 80 columns)
> > that need to be exposed to "merge". Works marvelously well in 
> > general, but some fields of the data frames actually contain
> > multiple ";"-separated values encoded as a character string without
> > defined order, which makes the fields not match each other.
> > 
> > Example:
> > 
> > > frame1[1,1]
> > [1] "some;thing"
> > >frame2[2,1]
> > [2] "thing;some"
> > 
> > In order to enable merging/duplicate identification of columns
> > containing these strings, I wrote the following function, which
> > passes through the rows one by one, identifies ";"-containing cells,
> > splits and resorts them.
> >
> > ResortCombinedFields <- function(dframe){
> >  if(!is.data.frame(dframe)){
> >    stop("\"ResortCombinedFields\" input needs to be a data frame.")
> >  }
> >  for(row in seq(nrow(dframe))){
> >    for(mef in grep(";",dframe[row,])){
> 
> I needed to add drop=TRUE to the above dframe[row,] for this to work.
> 
> >      dframe[row,mef] <- 
> paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";")
> >    }
> >  }
> >  return(dframe)
> > }
> > 
> > works fine, but is horribly inefficient. How might this be 
> tackled more elegantly?
> > 
> > Thanks for any input, Joh
> 
> It is usually faster to loop over columns of an data frame and use row
> subscripting, if needed, on individual columns.  E.g., the following
> 2 are much quicker on a sample 1000 by 4 dataset I made with
> 
> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3),
>    function(i)sapply(1:1000,
>       function(i)
>          
> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE), 
> collapse=";"))),
>    stringsAsFactors=FALSE)
> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe), 
> replace=TRUE) # no ;'s in column Four
> 
> The first function, f1, doesn't try to find which rows may 
> need adjusting
> and the second, f2, does.
> 
> f1 <- function(dframe){
>   if(!is.data.frame(dframe)){
>     stop("\"ResortCombinedFields\" input needs to be a data frame.")
>   }
>   for(icol in seq_len(ncol(dframe))){
>     dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol], 
> ";"), function(parts) paste(sort(parts), collapse=";")))
>   }
>   return(dframe)
> }
> 
> f2 <-
> function(dframe){
>   if(!is.data.frame(dframe)){
>     stop("\"ResortCombinedFields\" input needs to be a data frame.")
>   }
>   for(icol in seq_len(ncol(dframe))){
>     col <- dframe[,icol]
>     irow <- grep(";", col)
>     if (length(irow)) {
>         col[irow] <- unlist(lapply(strsplit(col[irow], ";"), 
> function(parts) paste(sort(parts), collapse=";")))
>         dframe[,icol] <- col
>     }
>   }
>   return(dframe)
> }
> 
> Times were
>  
> > unix.time(z<-ResortCombinedFields(dframe))
>    user  system elapsed
>   2.526   0.022   2.559
> > unix.time(f1z<-f1(dframe))
>    user  system elapsed
>   0.509   0.000   0.508
> > unix.time(f2z<-f2(dframe))
>    user  system elapsed
>   0.259   0.004   0.264
> > identical(z, f1z)
> [1] TRUE
> > identical(z, f2z)
> [1] TRUE

In R 2.7.0 (April 2008) f1() and f2() both take time proportional
to nrow(dframe), while your original ResortCombinedFields() takes
time proportional to the square of nrow(dframe).  E.g., for 50,000
rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds
It looks like 2.9 acts about the same.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com