[R] data frame vs. matrix

Mon Mar 17 00:36:59 CET 2014

Duncan's analysis suggests another way to do this:
extract the 'x' vector, operate on that vector in a loop,
then insert the result into the data.frame.  I added
a df="quicker" option to your df argument and made the test
dataset deterministic so we could verify that the algorithms
do the same thing:

dumkoll <- function(n = 1000, df = TRUE){
     dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
     if (identical(df, "quicker")) {
         x <- dfr$x
         for(i in 2:length(x)) {
             x[i] <- x[i-1]
         }
         dfr$x <- x
     } else if (df){
         for (i in 2:NROW(dfr)){
             # if (!(i %% 100)) cat("i = ", i, "\n")
             dfr$x[i] <- dfr$x[i-1]
         }
     }else{
         dm <- as.matrix(dfr)
         for (i in 2:NROW(dm)){
             # if (!(i %% 100)) cat("i = ", i, "\n")
             dm[i, 1] <- dm[i-1, 1]
         }
         dfr$x <- dm[, 1]
     }
     dfr
}

Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic
in n for the df=TRUE case and close to linear in the other cases, with
the new method taking about 60% the time of the matrix method:
   > n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
   > sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3])
              10k  20k  40k
   user.self 0.11 0.22 0.43
   sys.self  0.02 0.00 0.00
   elapsed   0.12 0.22 0.44
   > sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
              10k   20k   40k
   user.self 3.59 14.74 78.37
   sys.self  0.00  0.11  0.16
   elapsed   3.59 14.91 78.81
   > sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
              10k  20k  40k
   user.self 0.06 0.12 0.26
   sys.self  0.00 0.00 0.00
   elapsed   0.07 0.13 0.27
I also timed the 2 faster cases for n=10^6 and the time still looks linear
in n, with vector approach still taking about 60% the time of the matrix
approach.
   > system.time(dumkoll(n=10^6, df=FALSE))
      user  system elapsed 
     11.65    0.12   11.82 
   > system.time(dumkoll(n=10^6, df="quicker"))
      user  system elapsed 
      6.79    0.08    6.91
The results from each method are identical:
   > identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
   [1] TRUE
   > identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
   [1] TRUE

If your data.frame has columns of various types, then as.matrix will
coerce them all to a common type (often character), so it may give
you the wrong result in addition to being unnecessarily slow.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Duncan Murdoch
> Sent: Sunday, March 16, 2014 3:56 PM
> To: Göran Broström; r-help at r-project.org
> Subject: Re: [R] data frame vs. matrix
> 
> On 14-03-16 2:57 PM, Göran Broström wrote:
> > I have always known that "matrices are faster than data frames", for
> > instance this function:
> >
> >
> > dumkoll <- function(n = 1000, df = TRUE){
> >       dfr <- data.frame(x = rnorm(n), y = rnorm(n))
> >       if (df){
> >           for (i in 2:NROW(dfr)){
> >               if (!(i %% 100)) cat("i = ", i, "\n")
> >               dfr$x[i] <- dfr$x[i-1]
> >           }
> >       }else{
> >           dm <- as.matrix(dfr)
> >           for (i in 2:NROW(dm)){
> >               if (!(i %% 100)) cat("i = ", i, "\n")
> >               dm[i, 1] <- dm[i-1, 1]
> >           }
> >           dfr$x <- dm[, 1]
> >       }
> > }
> >
> > --------------------
> >   > system.time(dumkoll())
> >
> >      user  system elapsed
> >     0.046   0.000   0.045
> >
> >   > system.time(dumkoll(df = FALSE))
> >
> >      user  system elapsed
> >     0.007   0.000   0.008
> > ----------------------
> >
> > OK, no big deal, but I stumbled over a data frame with one million
> > records. Then, with df = TRUE,
> > ----------------------------
> >        user    system   elapsed
> > 44677.141  1271.544 46016.754
> > ----------------------------
> > This is around 12 hours.
> >
> > With df = FALSE, it took only six seconds! About 7500 time faster.
> >
> > I was really surprised by the huge difference, and I wonder if this is
> > to be expected, or if it is some peculiarity with my installation: I'm
> > running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
> 
> I don't find it surprising.  The line
> 
> dfr$x[i] <- dfr$x[i-1]
> 
> will be executed about a million times.  It does the following:
> 
> 1.  Get a pointer to the x element of dfr.  This requires R to look
> through all the names of dfr to figure out which one is "x".
> 
> 2.  Extract the i-1 element from it.  Not particularly slow.
> 
> 3.  Get a pointer to the x element of dfr again.  (R doesn't cache these
> things.)
> 
> 4.  Set the i element of it to a new value.  This could require the
> entire column or even the entire dataframe to be copied, if R hasn't
> kept track of the fact that it is really being changed in place.  In a
> complex assignment like that, I wouldn't be surprised if that took
> place.  (In the matrix equivalent, it would be easier to recognize that
> it is safe to change the existing value.)
> 
> Luke Tierney is making some changes in R-devel that might help a lot in
> cases like this, but I expect the matrix code will always be faster.
> 
> Duncan Murdoch
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.