[R] Splitting data.frame into a list of small data.frames given indices

Witold E Wolski wewolski at gmail.com
Wed Jun 29 11:16:56 CEST 2016


It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


 > nrows <- c(1e5,1e6,2e6,3e6,5e6)
> timing <- list()
> for(i in nrows){
+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }
> names(timing)<- nrows
> timing
$`1e+05`
   user  system elapsed
   0.05    0.00    0.05

$`1e+06`
   user  system elapsed
   1.48    2.98    4.46

$`2e+06`
   user  system elapsed
   7.25   11.39   18.65

$`3e+06`
   user  system elapsed
  16.15   25.81   41.99

$`5e+06`
   user  system elapsed
  43.22   74.72  118.09





-- 
Witold Eryk Wolski



More information about the R-help mailing list