[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Fri May 4 04:20:50 CEST 2012

A bit late and possibly tangential. 

The mmap package has something called struct() which is really a row-wise array of heterogenous columns.

As Simon and others have pointed out, R has no way to handle this natively, but mmap does provide a very measurable performance gain by orienting rows together in memory (mapped memory to be specific).  Since it is all "outside of R" so to speak, it (mmap) even supports many non-native types, from bit vectors to 64 bit ints with conversion caveats applicable. 

example(struct) shows some performance gains with this approach. 

There are even some crude methods to convert as is data.frames to mmap struct object directly (hint: as.mmap)

Again, likely not enough to shoehorn into your effort, but worth a look to see if it might be useful, and/or see the C design underlying it. 

Best,
Jeff

Jeffrey Ryan    |    Founder    |    jeffrey.ryan at lemnica.com

www.lemnica.com

On May 1, 2012, at 1:44 PM, Antonio Piccolboni <antonio at piccolboni.info> wrote:

> On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
> <simon.urbanek at r-project.org>wrote:
> 
>> 
>> On May 1, 2012, at 1:26 PM, Antonio Piccolboni <antonio at piccolboni.info>
>> wrote:
>> 
>>> It seems like people need to hear more context, happy to provide it. I am
>>> implementing a serialization format (typedbytes, HADOOP-1722 if people
>> want
>>> the gory details) to make R and Hadoop interoperate better (RHadoop
>>> project, package rmr). It is a row first format and it's already
>>> implemented as a C extension for R for lists and atomic vectors, where
>> each
>>> element  of a vector is a row. I need to extend it to accept data frames
>>> and I was wondering if I can use the existing C code by converting a data
>>> frame to a list of its rows. It sounds like the answer is that it is not
>> a
>>> good idea,
>> 
>> Just think about it -- data frames are lists of *columns* because the type
>> of each column is fixed. Treating them row-wise is extremely inefficient,
>> because you can't use any vector type to represent such thing (other than a
>> generic vector containing vectors of length 1).
>> 
> 
> Thanks, let's say this together with the experiments and other converging
> opinions lays the question to rest.
> 
> 
>>> that's helpful too in a way because it restricts the options. I
>>> thought I may be missing a simple primitive, like a t() for data frames
>>> (that doesn't coerce to matrix).
>> 
>> See above - I think you are misunderstanding data frames - t() makes no
>> sense for data frames.
>> 
> 
> I think you are misunderstanding my use of t(). Thanks
> 
> 
> Antonio
> 
> 
>> 
>> Cheers,
>> Simon
>> 
>> 
>> 
>>> On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk
>>> wrote:
>>> 
>>>> On 01/05/2012 00:28, Antonio Piccolboni wrote:
>>>> 
>>>>> Hi,
>>>>> I was wondering if there is anything more efficient than split to do
>> the
>>>>> kind of conversion in the subject. If I create a data frame as in
>>>>> 
>>>>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
>> paste("x",
>>>>> 1:2000, sep =""))})
>>>>> user  system elapsed
>>>>> 0.004   0.000   0.004
>>>>> 
>>>>> and then I try to split it
>>>>> 
>>>>> system.time(split(fd, 1:nrow(fd)))
>>>>>> 
>>>>> user  system elapsed
>>>>> 0.333   0.031   0.415
>>>>> 
>>>>> 
>>>>> You will be quick to notice the roughly two orders of magnitude
>> difference
>>>>> in time between creation and conversion. Granted, it's not written
>>>>> anywhere
>>>>> 
>>>> 
>>>> Unsurprising when you create three orders of magnitude more data frames,
>>>> is it?  That's a list of 2000 data frames.  Try
>>>> 
>>>> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
>>>> paste0("x", i)))
>>>> 
>>>> 
>>>> 
>>>> that they should be similar but the latter seems interpreter-slow to me
>>>>> (split is implemented with a lapply in the data frame case) There is
>> also
>>>>> a
>>>>> memory issue when I hit about 20000 elements (allocating 3GB when
>>>>> interrupted). So before I resort to Rcpp, despite the electrifying
>> feeling
>>>>> of approaching the bare metal and for the sake of getting things done,
>> I
>>>>> thought I would ask the experts. Thanks
>>>>> 
>>>> 
>>>> You need to re-think your data structures: 1-row data frames are not
>>>> sensible.
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Antonio
>>>>> 
>>>>>     [[alternative HTML version deleted]]
>>>>> 
>>>>> 
>>>>> ______________________________**________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/**listinfo/r-devel<
>> https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/<
>> http://www.stats.ox.ac.uk/~ripley/>
>>>> University of Oxford,             Tel:  +44 1865 272861 (self)
>>>> 1 South Parks Road,                     +44 1865 272866 (PA)
>>>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>>> 
>>> 
>>>     [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> 
>> 
>> 
> 
>   [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel