[Rd] speeding up perception

Sun Jul 3 16:26:03 CEST 2011

Robert,

it's not the handling of row names per se that causes the slowdown, but my point was that if what you need is just matrix-like structure with different column types, you may want to use lists instead and for equal column types you're better of with a matrix.

But to address your point, one of the reasons for subassignments on data frames being slow is that they need extra copies of the data frame for method dispatch. Data frames are lists of column vectors, so the penalty is worse with increasing number of columns. Rows play no significant (additional) role, because those are simply operations on the column vectors (they need to be copied on modification in any case).

In practice it would not matter as much unless the users do stupid things like the example loop. In that case the list holding the columns is copied twice for every single value of i which is deadly. Obviously the sensible thing to do m[1:1000,1] <- 1 does not have that issue.

So to illustrate part of the data.frame penalty effect consider simply falling back to lists in the assignment:

> example2 <- function(m){
+    for(i in 1:1000)
+        m[[1]][i] <- 1
+ }
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
   user  system elapsed 
 44.359  13.608  58.011 

> ### using a list is very fast as illustrated before:
> m <- as.list(as.data.frame(matrix(0, ncol=1000, nrow=1000)))
> system.time( example2(m) )
   user  system elapsed 
   0.01    0.00    0.01 

> ### now try to fall back to a list for each iteration (part of what the data frames have to do):
> example3 <- function(m){
+    for(i in 1:1000) {
+        oc <- class(m)
+        class(m) <- NULL
+        m[[1]][i] <- 1
+        class(m) <- oc
+    }
+ }
> system.time( example3(m) )
   user  system elapsed 
 19.080   2.251  21.335 

So just the simple fact that you unclass and re-class the object gives you half of the penalty that data.frames incur even if you're dealing with a list. Add the additional logic that data frames have to go through and you have the full picture.

So, as I was saying earlier, if you want to loop subassignments over many elements: don't do that in the first place, but if you do, use lists or matrices, NOT data frames.

Cheers,
Simon

On Jul 3, 2011, at 8:13 AM, Robert Stojnic wrote:

> 
> Hi Simon,
> 
> On 03/07/11 05:30, Simon Urbanek wrote:
>> This is just a quick, incomplete response, but the main misconception is really the use of data.frames. If you don't use the elaborate mechanics of data frames that involve the management of row names, then they are definitely the wrong tool to use, because most of the overhead is exactly to manage to row names and you pay a substantial penalty for that. Just drop that one feature and you get timings similar to a matrix:
> 
> I tried to find some documentation on why there needs to be extra row names handling when one is just assigning values into the column of a data frame, but couldn't find any. For a while I stared at the code of `[<-.data.frame` but couldn't figure out it myself. Can you please summarise what exactly is going one when one does m[1, 1] <- 1 where m is a data frame?
> 
> I found that the performance is significantly different with different number of columns. For instance
> 
> # reassign first column to 1
> example <- function(m){
>    for(i in 1:1000)
>        m[i,1] <- 1
> }
> 
> m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
> system.time( example(m) )
> 
>   user  system elapsed
>  0.164   0.000   0.163
> 
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example(m) )
> 
>   user  system elapsed
> 34.634   0.004  34.765
> 
> When m is a matrix, both run well under 0.1s.
> 
> Increasing the number of rows (but not the number of iterations) leads to some increase in time, but not as drastic when increasing column number. Using m[[y]][x] in this case doesn't help either.
> 
> example2 <- function(m){
>    for(i in 1:1000)
>        m[[1]][i] <- 1
> }
> 
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
> 
>   user  system elapsed
> 36.007   0.148  36.233
> 
> 
> r.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>