[Rd] split() is slow on data.frame (PR#14123)

Charles C. Berry cberry at tajo.ucsd.edu
Thu Dec 10 00:44:29 CET 2009


On Wed, 9 Dec 2009, William Dunlap wrote:

> Here are some differences between the current and proposed
> split.data.frame.

Adding 'drop=FALSE' fixes this case. See in line correction below.

Chuck

>
>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
> Named=c(one=1,two=2,three=3,four=4,five=5),
> row.names=as.character(1001:1005))
>> group<-c("A","B","A","A","B")
>> split.data.frame(d,group)
> $A
>     Matrix.1 Matrix.2 Named
> 1001        1        6     1
> 1003        3        8     3
> 1004        4        9     4
>
> $B
>     Matrix.1 Matrix.2 Named
> 1002        2        7     2
> 1005        5       10     5
>
>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
> [1] "processing data.frame"
> $A
>     Matrix Named
> [1,]      1     1
> [2,]      3     3
> [3,]      4     4
>
> $B
>     Matrix Named
> [1,]      2     2
> [2,]      5     5
>
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-devel-bounces at r-project.org
>> [mailto:r-devel-bounces at r-project.org] On Behalf Of
>> pengyu.ut at gmail.com
>> Sent: Wednesday, December 09, 2009 2:10 PM
>> To: r-devel at stat.math.ethz.ch
>> Cc: R-bugs at r-project.org
>> Subject: [Rd] split() is slow on data.frame (PR#14123)
>>
>> Please see the following code for the runtime comparison between
>> split() and mysplit.data.frame() (they do the same thing
>> semantically). mysplit.data.frame() is a fix of split() in term of
>> performance. Could somebody include this fix (with possible checking
>> for corner cases) in future version of R and let me know the inclusion
>> of the fix?
>>
>> m=300000
>> n=6
>> k=30000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> mysplit.data.frame<-function(x,f) {
>>   print('processing data.frame')
>>   v=lapply(
>>       1:dim(x)[[2]]
>>       , function(i) {
>>         split(x[,i],f)

Change to:

          split(x[,i,drop=FALSE],f)


>>       }
>>       )
>>
>>   w=lapply(
>>       seq(along=v[[1]])
>>       , function(i) {
>>         result=do.call(
>>             cbind
>>             , lapply(v,
>>                 function(vj) {
>>                   vj[[i]]
>>                 }
>>                 )
>>             )
>>         colnames(result)=colnames(x)
>>         return(result)
>>       }
>>       )
>>   names(w)=names(v[[1]])
>>   return(w)
>> }
>>
>> system.time(split(as.data.frame(x),f))
>> system.time(mysplit.data.frame(as.data.frame(x),f))
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-devel mailing list