[R] Efficient passing through big data.frame and modifying select

Johannes Graumann johannes_graumann at web.de
Wed Nov 26 11:36:58 CET 2008


Marvelous! Thanks guys for your hints and time! Very smooth now!

Joh

On Wednesday 26 November 2008 03:41:49 Henrik Bengtsson wrote:
> Alright, here are another $.02: using 'use.names=FALSE' in unlist() is
> much faster than the default 'use.names=TRUE'. /Henrik
>
> On Tue, Nov 25, 2008 at 6:40 PM, Henrik Bengtsson <hb at stat.berkeley.edu> 
wrote:
> > My $.02: Using argument 'fixed=TRUE' in strsplit() is much faster than
> > the default 'fixed=FALSE'. /Henrik
> >
> > On Tue, Nov 25, 2008 at 1:02 PM, William Dunlap <wdunlap at tibco.com> wrote:
> >>> -----Original Message-----
> >>> From: William Dunlap
> >>> Sent: Tuesday, November 25, 2008 9:16 AM
> >>> To: 'johannes_graumann at web.de'
> >>> Subject: Re: [R] Efficient passing through big data.frame and
> >>> modifying select fields
> >>>
> >>> > Johannes Graumann johannes_graumann at web.de
> >>> > Tue Nov 25 15:16:01 CET 2008
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I have relatively big data frames (> 10000 rows by 80 columns)
> >>> > that need to be exposed to "merge". Works marvelously well in
> >>> > general, but some fields of the data frames actually contain
> >>> > multiple ";"-separated values encoded as a character string without
> >>> > defined order, which makes the fields not match each other.
> >>> >
> >>> > Example:
> >>> > > frame1[1,1]
> >>> >
> >>> > [1] "some;thing"
> >>> >
> >>> > >frame2[2,1]
> >>> >
> >>> > [2] "thing;some"
> >>> >
> >>> > In order to enable merging/duplicate identification of columns
> >>> > containing these strings, I wrote the following function, which
> >>> > passes through the rows one by one, identifies ";"-containing cells,
> >>> > splits and resorts them.
> >>> >
> >>> > ResortCombinedFields <- function(dframe){
> >>> >  if(!is.data.frame(dframe)){
> >>> >    stop("\"ResortCombinedFields\" input needs to be a data frame.")
> >>> >  }
> >>> >  for(row in seq(nrow(dframe))){
> >>> >    for(mef in grep(";",dframe[row,])){
> >>>
> >>> I needed to add drop=TRUE to the above dframe[row,] for this to work.
> >>>
> >>> >      dframe[row,mef] <-
> >>>
> >>> paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";")
> >>>
> >>> >    }
> >>> >  }
> >>> >  return(dframe)
> >>> > }
> >>> >
> >>> > works fine, but is horribly inefficient. How might this be
> >>>
> >>> tackled more elegantly?
> >>>
> >>> > Thanks for any input, Joh
> >>>
> >>> It is usually faster to loop over columns of an data frame and use row
> >>> subscripting, if needed, on individual columns.  E.g., the following
> >>> 2 are much quicker on a sample 1000 by 4 dataset I made with
> >>>
> >>> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3),
> >>>    function(i)sapply(1:1000,
> >>>       function(i)
> >>>
> >>> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE),
> >>> collapse=";"))),
> >>>    stringsAsFactors=FALSE)
> >>> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe),
> >>> replace=TRUE) # no ;'s in column Four
> >>>
> >>> The first function, f1, doesn't try to find which rows may
> >>> need adjusting
> >>> and the second, f2, does.
> >>>
> >>> f1 <- function(dframe){
> >>>   if(!is.data.frame(dframe)){
> >>>     stop("\"ResortCombinedFields\" input needs to be a data frame.")
> >>>   }
> >>>   for(icol in seq_len(ncol(dframe))){
> >>>     dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol],
> >>> ";"), function(parts) paste(sort(parts), collapse=";")))
> >>>   }
> >>>   return(dframe)
> >>> }
> >>>
> >>> f2 <-
> >>> function(dframe){
> >>>   if(!is.data.frame(dframe)){
> >>>     stop("\"ResortCombinedFields\" input needs to be a data frame.")
> >>>   }
> >>>   for(icol in seq_len(ncol(dframe))){
> >>>     col <- dframe[,icol]
> >>>     irow <- grep(";", col)
> >>>     if (length(irow)) {
> >>>         col[irow] <- unlist(lapply(strsplit(col[irow], ";"),
> >>> function(parts) paste(sort(parts), collapse=";")))
> >>>         dframe[,icol] <- col
> >>>     }
> >>>   }
> >>>   return(dframe)
> >>> }
> >>>
> >>> Times were
> >>>
> >>> > unix.time(z<-ResortCombinedFields(dframe))
> >>>
> >>>    user  system elapsed
> >>>   2.526   0.022   2.559
> >>>
> >>> > unix.time(f1z<-f1(dframe))
> >>>
> >>>    user  system elapsed
> >>>   0.509   0.000   0.508
> >>>
> >>> > unix.time(f2z<-f2(dframe))
> >>>
> >>>    user  system elapsed
> >>>   0.259   0.004   0.264
> >>>
> >>> > identical(z, f1z)
> >>>
> >>> [1] TRUE
> >>>
> >>> > identical(z, f2z)
> >>>
> >>> [1] TRUE
> >>
> >> In R 2.7.0 (April 2008) f1() and f2() both take time proportional
> >> to nrow(dframe), while your original ResortCombinedFields() takes
> >> time proportional to the square of nrow(dframe).  E.g., for 50,000
> >> rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds
> >> It looks like 2.9 acts about the same.
> >>
> >> Bill Dunlap
> >> TIBCO Software Inc - Spotfire Division
> >> wdunlap tibco.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html and provide commented,
> >> minimal, self-contained, reproducible code.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20081126/f2c08061/attachment.bin>


More information about the R-help mailing list