[R] Significant performance difference between split of adata.frame and split of vectors

Fri Dec 18 16:15:33 CET 2009

Thanks for suggesting data.table. It does have advantages in this example 
but it has to be used in a particular way.

What does Peng actually want to achieve?  I'll guess (but its only a guess) 
that he doesn't actually need to hold the entire table in memory in a split 
up format before doing something with each subset.  He only needs each 
subset at a time.  Most of the time I guess most users are similar in this 
regard.

data.table has a mechanism to do this, and do it in a compact syntax that is 
quicker to program. When you use the 'by' argument,  it evaluates the j 
expression within the first subset by the 'by' and then doesn't need that 
subset anymore. It then moves on to the next subset. So it operates in a 
smaller memory footprint than i) creating an entire copy of the data in a 
split up format, and then ii) lapply'ing through that new list doing 
something on the subsets.

By guessing at something realistic we might want to actually do on each 
subset,  here is the example again :

> m = 30000
> n = 6
> k = 3000
> x = replicate(n,rnorm(m))
> f = sample(1:k,size=m,replace=TRUE)
> dt = data.table(x,f)
> dt[,sum(V1),by="f"]    # this is the proper mechanism to split a 
> data.table and operate on each subset. column names can be used as 
> variables directly.
     f                  V1
     1    1.82720825112107
     2    4.22721189592209
     3  -0.409096014477913
...[ snip ] ...
> system.time(dt[,sum(V1),by="f"])    # same again but timing it this time
    user  system elapsed
   0.13    0.00    0.12
>
>  system.time(split(as.data.frame(x),f))
   user  system elapsed
   1.55    0.00    1.55
>

So just splitting a data.frame is about 10 times slower than splitting the 
data.table (in the proper way), and even then the data.frame split still 
needs something (more program code) to loop through afterwards and do 
something useful on the split up copy.

Its not just speed,  but data.table may work when data.frame may fail. An 
example could be constructed where x takes 55% of available ram. I'd expect 
that splitting a data.frame should fail with an out of memory error (as it 
needs another 55% for the full copy).  A data.table 'by' in contrast should 
work fine since it only requires the memory for the largest subset at any 
one time (other than in the esoteric case of there being only one subset).

Having said the above I still consider 'by' in data.table to be slow,  but 
just relative to how fast it could be. See feature request #195 in the 
package's R-forge project. There are some other more complicated 
circumstances when the 'by' argument can result in slow performance 
(possibly relative to data.frame or matrix methods) but this example doesn't 
seem to be in that category,  at least with what we've seen so far.  The 
problem in this thread so far looks to be due to split() itself.

Please note there is no split.data.table method and the default method does 
not appear efficient. methods(split) shows there is no split method for 
data.table but data.frame has its own special split method. The following 
method has been added as a feature request in the data.table project :
    split.data.table = function(...){stop("Use 'by' argument instead of 
split() on data.table. See ?'[.data.table'")}

Hope this helps,
Matthew

"David Winsemius" <dwinsemius at comcast.net> wrote in message 
news:CE3E3AFF-F2B5-4C80-9D00-00954130AA6B at comcast.net...
>
> On Dec 9, 2009, at 2:59 PM, Peng Yu wrote:
>
>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net
>> > wrote:
>>>
>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>
>>>> I have the following code, which tests the split on a data.frame and
>>>> the split on each column (as vector) separately. The runtimes are of
>>>> 10 time difference. When m and k increase, the difference become  even
>>>> bigger.
>>>>
>>>> I'm wondering why the performance on data.frame is so bad. Is it a  bug
>>>> in R? Can it be improved?
>>>
>>> You might want to look at the data.table package. The author calinms
>>> significant speed improvements over dta.frames
>>
>> 'data.table' doesn't seem to help. You can try the other set of m,n,k.
>> In both case, using as.data.frame is faster than using as.data.table.
>>
>> Please let me know if I understand what you meant.
>
> I was only suggesting that you look at it because it appeared in other 
> situation to have efficiency advantages. As it turned out, that  structure 
> offered no advantage, when I tested it.
>
> --
> David.
>
>
>>
>>> m=10
>>> n=6
>>> k=3
>>>
>>> #m=300000
>>> #n=6
>>> #k=30000
>>>
>>> set.seed(0)
>>> x=replicate(n,rnorm(m))
>>> f=sample(1:k, size=m, replace=T)
>>>
>>> library(data.table)
>> Loading required package: ref
>> dim(refdata) and dimnames(refdata) no longer allow parameter ref=TRUE,
>> use dim(derefdata(refdata)), dimnames(derefdata(refdata)) instead
>>> system.time(split(as.data.frame(x),f))
>>   user  system elapsed
>>  0.000   0.000   0.003
>>> system.time(split(as.data.table(x),f))
>>   user  system elapsed
>>  0.010   0.000   0.011
>>
>>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>>  user  system elapsed
>>>>  1.700   0.010   1.786
>>>>>
>>>>> system.time(lapply(
>>>>
>>>> +         1:dim(x)[[2]]
>>>> +         , function(i) {
>>>> +           split(x[,i],f)
>>>> +         }
>>>> +         )
>>>> +     )
>>>>  user  system elapsed
>>>>  0.170   0.000   0.167
>>>>
>>>> ###########
>>>> m=30000
>>>> n=6
>>>> k=3000
>>>>
>>>> set.seed(0)
>>>> x=replicate(n,rnorm(m))
>>>> f=sample(1:k, size=m, replace=T)
>>>>
>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>> system.time(lapply(
>>>>       1:dim(x)[[2]]
>>>>       , function(i) {
>>>>         split(x[,i],f)
>>>>       }
>>>>       )
>>>>   )
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>