[R] Creating a custom connection to read from multiple files

Thu Jan 20 13:51:29 CET 2005

> From: Prof Brian Ripley
> 
> On Thu, 20 Jan 2005, Tomas Kalibera wrote:
> 
> > Dear Prof Ripley,
> >
> > thanks for your suggestions, it's very nice one can create 
> custom connections 
> > directly in R and I think it is what I need just now.
> >
> >> However, what is wrong with reading a file at a time and 
> combining the 
> >> results in R using rbind?
> >> 
> > Well, the problem is performance. If I concatenate all 
> those files, they have 
> > around 8MB, can grow to tens of MBs in near future.
> >
> > Both concatenating and reading from a single file by scan 
> takes 5 seconds 
> > (which is almost OK).
> >
> > However, reading individual files by read.table and 
> rbinding one by one ( 
> > samples=rbind(samples, newSamples ) takes minutes. The same 
> is when I 
> > concatenate lists manually. Scan does not help 
> significantly. I guess there 
> > is some overhead in detecting dimensions of objects in rbind (?) or 
> > re-allocation or copying data ?
> 
> rbind is vectorized so you are using it (way) suboptimally.

Here's an example:

>  ## Create a 500 x 100 data matrix.
>  x <- matrix(rnorm(5e4), 500, 100)
>  ## Generate 50 filenames.
>  fname <- paste("f", formatC(1:50, width=2, flag="0"), ".txt", sep="")
>  ## Write the data to files 50 times.
>  for (f in fname) write(t(x), file=f, ncol=ncol(x))
>  
>  ## Read the files into a list of data frames.
>  system.time(datList <- lapply(fname, read.table, header=FALSE),
gcFirst=TRUE)
[1] 11.91  0.05 12.33    NA    NA
>  ## Specify colClasses to speed up.
>  system.time(datList <- lapply(fname, read.table,
colClasses=rep("numeric", 100)),
+              gcFirst=TRUE)
[1] 10.69  0.07 10.79    NA    NA
>  ## Stack them together.
>  system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
[1] 5.34 0.09 5.45   NA   NA
>  
>  ## Use matrices instead of data frames.
>  system.time(datList <- lapply(fname, 
+      function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
Read 50000 items
...
Read 50000 items
[1]  9.49  0.08 15.06    NA    NA
>  system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
[1] 0.09 0.03 0.12   NA   NA
>  ## Clean up the files.
>  unlink(fname)

A couple of points:

- Usually specifying colClasses will make read.table() quite a bit 
  faster, even though it's only marginally faster here.  Look back
  in the list archive to see examples.

- If your data files are all numerics (as in this example), 
  storing them in matrices will be much more efficient.  Note
  the difference in rbind()ing the 50 data frames and 50 
  matrices (5.34 seconds vs. 0.09!).  rbind.data.frame()
  needs to ensure that the resulting data frame has unique
  rownames (a requirement for a legit data frame), and
  that's probably taking a big chunk of the time.

Andy

> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>