[R] [Rd] split() is slow on data.frame (PR#14123)

Thu Dec 10 04:38:33 CET 2009

Sorry. I sent this to r-help by mistake. Could somebody help delete it
from the archive?

On Wed, Dec 9, 2009 at 9:29 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
> I make a version for matrix. Because, it would be more efficient to
> split each column of a matrix than to convert a matrix to a data.frame
> then call split() on the data.frame. Note that the version for a
> matrix and a data.frame is slightly different. Would somebody add this
> in R as well?
>
> split.matrix<-function(x,f) {
>  #print('processing matrix')
>  v=lapply(
>      1:dim(x)[[2]]
>      , function(i) {
>        base:::split.default(x[,i],f)#the difference is here
>      }
>      )
>
>  w=lapply(
>      seq(along=v[[1]])
>      , function(i) {
>        result=do.call(
>            cbind
>            , lapply(v,
>                function(vj) {
>                  vj[[i]]
>                }
>                )
>            )
>        colnames(result)=colnames(x)
>        return(result)
>      }
>      )
>  names(w)=names(v[[1]])
>  return(w)
> }
>
>
> On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>> On Wed, 9 Dec 2009, William Dunlap wrote:
>>
>>> Here are some differences between the current and proposed
>>> split.data.frame.
>>
>> Adding 'drop=FALSE' fixes this case. See in line correction below.
>>
>> Chuck
>>
>>>
>>>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
>>>
>>> Named=c(one=1,two=2,three=3,four=4,five=5),
>>> row.names=as.character(1001:1005))
>>>>
>>>> group<-c("A","B","A","A","B")
>>>> split.data.frame(d,group)
>>>
>>> $A
>>>    Matrix.1 Matrix.2 Named
>>> 1001        1        6     1
>>> 1003        3        8     3
>>> 1004        4        9     4
>>>
>>> $B
>>>    Matrix.1 Matrix.2 Named
>>> 1002        2        7     2
>>> 1005        5       10     5
>>>
>>>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
>>>
>>> [1] "processing data.frame"
>>> $A
>>>    Matrix Named
>>> [1,]      1     1
>>> [2,]      3     3
>>> [3,]      4     4
>>>
>>> $B
>>>    Matrix Named
>>> [1,]      2     2
>>> [2,]      5     5
>>>
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> -----Original Message-----
>>>> From: r-devel-bounces at r-project.org
>>>> [mailto:r-devel-bounces at r-project.org] On Behalf Of
>>>> pengyu.ut at gmail.com
>>>> Sent: Wednesday, December 09, 2009 2:10 PM
>>>> To: r-devel at stat.math.ethz.ch
>>>> Cc: R-bugs at r-project.org
>>>> Subject: [Rd] split() is slow on data.frame (PR#14123)
>>>>
>>>> Please see the following code for the runtime comparison between
>>>> split() and mysplit.data.frame() (they do the same thing
>>>> semantically). mysplit.data.frame() is a fix of split() in term of
>>>> performance. Could somebody include this fix (with possible checking
>>>> for corner cases) in future version of R and let me know the inclusion
>>>> of the fix?
>>>>
>>>> m=300000
>>>> n=6
>>>> k=30000
>>>>
>>>> set.seed(0)
>>>> x=replicate(n,rnorm(m))
>>>> f=sample(1:k, size=m, replace=T)
>>>>
>>>> mysplit.data.frame<-function(x,f) {
>>>>  print('processing data.frame')
>>>>  v=lapply(
>>>>      1:dim(x)[[2]]
>>>>      , function(i) {
>>>>        split(x[,i],f)
>>
>> Change to:
>>
>>         split(x[,i,drop=FALSE],f)
>>
>>
>>>>      }
>>>>      )
>>>>
>>>>  w=lapply(
>>>>      seq(along=v[[1]])
>>>>      , function(i) {
>>>>        result=do.call(
>>>>            cbind
>>>>            , lapply(v,
>>>>                function(vj) {
>>>>                  vj[[i]]
>>>>                }
>>>>                )
>>>>            )
>>>>        colnames(result)=colnames(x)
>>>>        return(result)
>>>>      }
>>>>      )
>>>>  names(w)=names(v[[1]])
>>>>  return(w)
>>>> }
>>>>
>>>> system.time(split(as.data.frame(x),f))
>>>> system.time(mysplit.data.frame(as.data.frame(x),f))
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> Charles C. Berry                            (858) 534-2098
>>                                            Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu               UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>>
>>
>>
>