[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Wed Jun 29 16:22:38 CEST 2016

Your answer did not make it to the list...

Le 29/06/2016 à 16:06, Witold E Wolski a écrit :
> If you do not understand than why do you reply?
>
>
> On 29 June 2016 at 15:54, Ivan Calandra <ivan.calandra at univ-reims.fr> wrote:
>> Hi,
>>
>> I don't really understand why you split every row... This makes it very
>> slow. Try with a more realistic example (with a factor to split).
>>
>> Ivan
>>
>> --
>> Ivan Calandra, PhD
>> Scientific Mediator
>> University of Reims Champagne-Ardenne
>> GEGENAA - EA 3795
>> CREA - 2 esplanade Roland Garros
>> 51100 Reims, France
>> +33(0)3 26 77 36 89
>> ivan.calandra at univ-reims.fr
>> --
>> https://www.researchgate.net/profile/Ivan_Calandra
>> https://publons.com/author/705639/
>>
>>
>> Le 29/06/2016 à 15:21, Witold E Wolski a écrit :
>>> Hi,
>>>
>>> Here is an complete example which shows the the complexity of split or
>>> by is O(n^2)
>>>
>>> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
>>> res<-list()
>>>
>>> for(i in nrows){
>>>     dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>>>     res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
>>> }
>>> res <- do.call("rbind",res)
>>> plot(nrows^2, res[,"elapsed"])
>>>
>>> And I can't see a reason why this has to be so slow.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz> wrote:
>>>> On 29/06/16 21:16, Witold E Wolski wrote:
>>>>> It's the inverse problem to merging a list of data.frames into a large
>>>>> data.frame just discussed in the "performance of do.call("rbind")"
>>>>> thread
>>>>>
>>>>> I would like to split a data.frame into a list of data.frames
>>>>> according to first column.
>>>>> This SEEMS to be easily possible with the function base::by. However,
>>>>> as soon as the data.frame has a few million rows this function CAN NOT
>>>>> BE USED (except you have A PLENTY OF TIME).
>>>>>
>>>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>>>>
>>>>> So basically I am looking for a similar function with better complexity.
>>>>>
>>>>>
>>>>>    > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>>>> timing <- list()
>>>>>> for(i in nrows){
>>>>> + dum <- peaks[1:i,]
>>>>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>>>> + }
>>>>>> names(timing)<- nrows
>>>>>> timing
>>>>> $`1e+05`
>>>>>      user  system elapsed
>>>>>      0.05    0.00    0.05
>>>>>
>>>>> $`1e+06`
>>>>>      user  system elapsed
>>>>>      1.48    2.98    4.46
>>>>>
>>>>> $`2e+06`
>>>>>      user  system elapsed
>>>>>      7.25   11.39   18.65
>>>>>
>>>>> $`3e+06`
>>>>>      user  system elapsed
>>>>>     16.15   25.81   41.99
>>>>>
>>>>> $`5e+06`
>>>>>      user  system elapsed
>>>>>     43.22   74.72  118.09
>>>>
>>>> I'm not sure that I follow what you're doing, and your example is not
>>>> reproducible, since we have no idea what "peaks" is, but on a toy example
>>>> with 5e6 rows in the data frame I got a timing result of
>>>>
>>>>      user  system elapsed
>>>>     0.379 0.025 0.406
>>>>
>>>> when I applied split().  Is this adequately fast? Seems to me that if you
>>>> want to split something, split() would be a good place to start.
>>>>
>>>> cheers,
>>>>
>>>> Rolf Turner
>>>>
>>>> --
>>>> Technical Editor ANZJS
>>>> Department of Statistics
>>>> University of Auckland
>>>> Phone: +64-9-373-7599 ext. 88276
>>>
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>