[R] alternative to rbind within a loop

Greg Snow Greg.Snow at imail.org
Thu Jul 23 23:27:47 CEST 2009


Try something like (untested):

> mylist <- lapply(all.files, function(i) read.csv(i) )
> mydf <- do.call('rbind', mylist)

If all the csv files are conformable that rbind works on them (if the loop method works then that should be the case) then this will read in each file, store the data frames as a list, then rbind them all together.

It seems that this should be faster than the loop, but testing will be needed to be sure.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Denis Chabot
> Sent: Thursday, July 23, 2009 1:54 PM
> To: list R
> Subject: [R] alternative to rbind within a loop
> 
> Hi,
> 
> I often have to do this:
> 
> select a folder (directory) containing a few hundred data files in csv
> format (up to 1000 files, in fact)
> 
> open each file, transform some character variables in date-tiime format
> 
> make into a dataframe (involves getting rid of a few variables I don't
> need
> 
> concatenate to the master dataframe that will eventually contain the
> data from all the files in the folder.
> 
> I use a loop going from 1 to the number of files. I have added a
> command to print an incrementing number to the R console each time the
> loop completes one iteration, to judge the speed of the process.
> 
> At the beginning, 3-4 files are processed each second. After a few
> hundred iterations it slows down to about 1 file per second. Before I
> reach the last file (898 in the case at hand), it has become much
> slower, about 1 file every 2-3 seconds.
> 
> This progressive slowing down suggests the problem is linked to the
> size of the growing "master" dataframe that rbind combines with each
> new file.
> 
> In fact, the small script below confirms this as nothing at all
> happens within the loop but rbind. You can cut the size of this
> example not to waste to much of your time:
> 
> 
> # create a dummy data.frame and copy it in a large number of csv files
> 
> test  <- file.path("test")
> 
> a <- 1:350
> b <- rnorm(350,100,10)
> c <- runif(350, 0, 100)
> d <- month.name[runif(350,1,12)]
> 
> the.data <- data.frame(a,b,c,d)
> 
> for(i in 1:850){
> 	write.csv(the.data, file=paste(test, "/file_", i, ".csv",
> sep=""))
> }
> 
> # now lets make a single dataframe from all these csv files
> 
> all.files <- list.files(path=test,full.names=T,pattern=".csv")
> 
> new.data <- NULL
> 
> system.time({
> 	for(i in all.files){
> 	in.data <- read.csv(i)
> 	if (is.null(new.data)) {new.data = in.data} else {new.data =
> rbind(new.data, in.data)}
> 	cat(paste(i, ", ", sep=""))
> } # end for
> }) # end system.time
> 
> utilisateur     système      écoulé
>      156.206      44.859     202.150
> This is with
> 
> sessionInfo()
> R version 2.9.1 Patched (2009-07-16 r48939)
> x86_64-apple-darwin9.7.0
> 
> locale:
> fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] doBy_3.7        chron_2.3-30    timeDate_290.84
> 
> loaded via a namespace (and not attached):
> [1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2     lattice_0.17-25
> tools_2.9.1
> 
> 
> Would it be better to somehow save all 850 files in one dataframe
> each, and then rbind them all in a single operation?
> 
> Can I combine all my files without using a loop? I've never quite
> mastered the "apply" family of functions but have not seen examples to
> read files.
> 
> Thanks in advance,
> 
> Denis Chabot
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list