[R] Significant performance difference between split of a data.frame and split of vectors

David Winsemius dwinsemius at comcast.net
Wed Dec 9 06:06:02 CET 2009


On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:

> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net 
> > wrote:
>>
>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>
>>> I have the following code, which tests the split on a data.frame and
>>> the split on each column (as vector) separately. The runtimes are of
>>> 10 time difference. When m and k increase, the difference become  
>>> even
>>> bigger.
>>>
>>> I'm wondering why the performance on data.frame is so bad. Is it a  
>>> bug
>>> in R? Can it be improved?
>>
>> You might want to look at the data.table package. The author calinms
>> significant speed improvements over dta.frames
>
> This bug has been found long time back and a package has been
> developed for it. Should the fix be integrated in data.frame rather
> than be implemented in an additional package?

What bug?

>
>> David.
>>>
>>>> system.time(split(as.data.frame(x),f))
>>>
>>>  user  system elapsed
>>>  1.700   0.010   1.786
>>>>
>>>> system.time(lapply(
>>>
>>> +         1:dim(x)[[2]]
>>> +         , function(i) {
>>> +           split(x[,i],f)
>>> +         }
>>> +         )
>>> +     )
>>>  user  system elapsed
>>>  0.170   0.000   0.167
>>>
>>> ###########
>>> m=30000
>>> n=6
>>> k=3000
>>>
>>> set.seed(0)
>>> x=replicate(n,rnorm(m))
>>> f=sample(1:k, size=m, replace=T)
>>>
>>> system.time(split(as.data.frame(x),f))
>>>
>>> system.time(lapply(
>>>       1:dim(x)[[2]]
>>>       , function(i) {
>>>         split(x[,i],f)
>>>       }
>>>       )
>>>   )
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list