[R] Significant performance difference between split of a data.frame and split of vectors

David Winsemius dwinsemius at comcast.net
Wed Dec 9 21:42:10 CET 2009


On Dec 9, 2009, at 2:59 PM, Peng Yu wrote:

> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net 
> > wrote:
>>
>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>
>>> I have the following code, which tests the split on a data.frame and
>>> the split on each column (as vector) separately. The runtimes are of
>>> 10 time difference. When m and k increase, the difference become  
>>> even
>>> bigger.
>>>
>>> I'm wondering why the performance on data.frame is so bad. Is it a  
>>> bug
>>> in R? Can it be improved?
>>
>> You might want to look at the data.table package. The author calinms
>> significant speed improvements over dta.frames
>
> 'data.table' doesn't seem to help. You can try the other set of m,n,k.
> In both case, using as.data.frame is faster than using as.data.table.
>
> Please let me know if I understand what you meant.

I was only suggesting that you look at it because it appeared in other  
situation to have efficiency advantages. As it turned out, that  
structure offered no advantage, when I tested it.

--
David.


>
>> m=10
>> n=6
>> k=3
>>
>> #m=300000
>> #n=6
>> #k=30000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> library(data.table)
> Loading required package: ref
> dim(refdata) and dimnames(refdata) no longer allow parameter ref=TRUE,
> use dim(derefdata(refdata)), dimnames(derefdata(refdata)) instead
>> system.time(split(as.data.frame(x),f))
>   user  system elapsed
>  0.000   0.000   0.003
>> system.time(split(as.data.table(x),f))
>   user  system elapsed
>  0.010   0.000   0.011
>
>>>> system.time(split(as.data.frame(x),f))
>>>
>>>  user  system elapsed
>>>  1.700   0.010   1.786
>>>>
>>>> system.time(lapply(
>>>
>>> +         1:dim(x)[[2]]
>>> +         , function(i) {
>>> +           split(x[,i],f)
>>> +         }
>>> +         )
>>> +     )
>>>  user  system elapsed
>>>  0.170   0.000   0.167
>>>
>>> ###########
>>> m=30000
>>> n=6
>>> k=3000
>>>
>>> set.seed(0)
>>> x=replicate(n,rnorm(m))
>>> f=sample(1:k, size=m, replace=T)
>>>
>>> system.time(split(as.data.frame(x),f))
>>>
>>> system.time(lapply(
>>>       1:dim(x)[[2]]
>>>       , function(i) {
>>>         split(x[,i],f)
>>>       }
>>>       )
>>>   )
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list