[R] data frame vs. matrix

Göran Broström goran.brostrom at umu.se
Mon Mar 17 11:16:05 CET 2014



On 2014-03-16 23:56, Duncan Murdoch wrote:
> On 14-03-16 2:57 PM, Göran Broström wrote:
>> I have always known that "matrices are faster than data frames", for
>> instance this function:
>>
>>
>> dumkoll <- function(n = 1000, df = TRUE){
>>        dfr <- data.frame(x = rnorm(n), y = rnorm(n))
>>        if (df){
>>            for (i in 2:NROW(dfr)){
>>                if (!(i %% 100)) cat("i = ", i, "\n")
>>                dfr$x[i] <- dfr$x[i-1]
>>            }
>>        }else{
>>            dm <- as.matrix(dfr)
>>            for (i in 2:NROW(dm)){
>>                if (!(i %% 100)) cat("i = ", i, "\n")
>>                dm[i, 1] <- dm[i-1, 1]
>>            }
>>            dfr$x <- dm[, 1]
>>        }
>> }
>>
>> --------------------
>>    > system.time(dumkoll())
>>
>>       user  system elapsed
>>      0.046   0.000   0.045
>>
>>    > system.time(dumkoll(df = FALSE))
>>
>>       user  system elapsed
>>      0.007   0.000   0.008
>> ----------------------
>>
>> OK, no big deal, but I stumbled over a data frame with one million
>> records. Then, with df = TRUE,
>> ----------------------------
>>         user    system   elapsed
>> 44677.141  1271.544 46016.754
>> ----------------------------
>> This is around 12 hours.
>>
>> With df = FALSE, it took only six seconds! About 7500 time faster.
>>
>> I was really surprised by the huge difference, and I wonder if this is
>> to be expected, or if it is some peculiarity with my installation: I'm
>> running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
> I don't find it surprising.  The line
>
> dfr$x[i] <- dfr$x[i-1]
>
> will be executed about a million times.  It does the following:

Thanks for the explanation; I got the idea that dfr[1, i] <- might be 
faster than dfr$x[i] <- , but it is in fact significantly slower.
Helpful experience.

Göran
>
> 1.  Get a pointer to the x element of dfr.  This requires R to look
> through all the names of dfr to figure out which one is "x".
>
> 2.  Extract the i-1 element from it.  Not particularly slow.
>
> 3.  Get a pointer to the x element of dfr again.  (R doesn't cache these
> things.)
>
> 4.  Set the i element of it to a new value.  This could require the
> entire column or even the entire dataframe to be copied, if R hasn't
> kept track of the fact that it is really being changed in place.  In a
> complex assignment like that, I wouldn't be surprised if that took
> place.  (In the matrix equivalent, it would be easier to recognize that
> it is safe to change the existing value.)
>
> Luke Tierney is making some changes in R-devel that might help a lot in
> cases like this, but I expect the matrix code will always be faster.
>
> Duncan Murdoch
>




More information about the R-help mailing list