[R] Creating a custom connection to read from multiple files

Fri Jan 21 14:30:21 CET 2005

Hello Andy,

thanks for your examples, I rewrote everything to matrices & 
lapply/sapply, rbind  calls instead of for-cycles & appends, it really 
helped. Reading files one by one and concatenating is now even faster 
than concatenating on disk, that 8MB table is read in 3.5 seconds.

Tomas

>>rbind is vectorized so you are using it (way) suboptimally.
>>    
>>
>
>Here's an example:
>
>  
>
>> ## Create a 500 x 100 data matrix.
>> x <- matrix(rnorm(5e4), 500, 100)
>> ## Generate 50 filenames.
>> fname <- paste("f", formatC(1:50, width=2, flag="0"), ".txt", sep="")
>> ## Write the data to files 50 times.
>> for (f in fname) write(t(x), file=f, ncol=ncol(x))
>> 
>> ## Read the files into a list of data frames.
>> system.time(datList <- lapply(fname, read.table, header=FALSE),
>>    
>>
>gcFirst=TRUE)
>[1] 11.91  0.05 12.33    NA    NA
>  
>
>> ## Specify colClasses to speed up.
>> system.time(datList <- lapply(fname, read.table,
>>    
>>
>colClasses=rep("numeric", 100)),
>+              gcFirst=TRUE)
>[1] 10.69  0.07 10.79    NA    NA
>  
>
>> ## Stack them together.
>> system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
>>    
>>
>[1] 5.34 0.09 5.45   NA   NA
>  
>
>> 
>> ## Use matrices instead of data frames.
>> system.time(datList <- lapply(fname, 
>>    
>>
>+      function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE)
>Read 50000 items
>...
>Read 50000 items
>[1]  9.49  0.08 15.06    NA    NA
>  
>
>> system.time(dat <- do.call("rbind", datList), gcFirst=TRUE)
>>    
>>
>[1] 0.09 0.03 0.12   NA   NA
>  
>
>> ## Clean up the files.
>> unlink(fname)
>>    
>>
>
>A couple of points:
>
>- Usually specifying colClasses will make read.table() quite a bit 
>  faster, even though it's only marginally faster here.  Look back
>  in the list archive to see examples.
>
>- If your data files are all numerics (as in this example), 
>  storing them in matrices will be much more efficient.  Note
>  the difference in rbind()ing the 50 data frames and 50 
>  matrices (5.34 seconds vs. 0.09!).  rbind.data.frame()
>  needs to ensure that the resulting data frame has unique
>  rownames (a requirement for a legit data frame), and
>  that's probably taking a big chunk of the time.
>
>Andy
>
> 
>  
>