[R] Incremental ReadLines

Mon Nov 2 20:36:45 CET 2009

On 11/2/2009 2:03 PM, Gene Leynes wrote:
> I've been trying to figure out how to read in a large file for a few days
> now, and after extensive research I'm still not sure what to do.
> 
> I have a large comma delimited text file that contains 59 fields in each
> record.
> There is also a header every 121 records

You can open the connection before reading, then read in blocks of lines 
and process those.  You don't need to reopen it every time.  For example,

ff <- file(fname, open="rt")  # rt is read text
for (block in 1:nblocks) {
   x <- readLines(ff, n=121)
   # process this block
}
close(ff)

Duncan Murdoch

> 
> This function works well for smallish records
> getcsv=function(fname){
>     ff=file(description = fname)
>     x <- readLines(ff)
>     closeAllConnections()
>     x <- x[x != ""]          # REMOVE BLANKS
>     x=x[grep("^[-0-9]", x)]  # REMOVE ALL TEXT
> 
>     spl=strsplit(x,',')      # THIS PART IS SLOW, BUT MANAGABLE
> 
> xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])))))
>     return(xx)
> }
> It's not elegant, but it works.
> For 121,000 records it completes in 2.3 seconds
> For 121,000*5 records it completes in 63 seconds
> For 121,000*10 records it doesn't complete
> 
> When I try other methods to read the file in chunks (using scan), the
> process breaks down because I have to start at the beginning of the file on
> every iteration.
> For example:
> fnn=function(n,col){
>     a=122*(n-1)+2
>     xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
>     xx=xx[xx!='']
>     xx=matrix(xx,ncol=49,byrow=TRUE)
>     xx[,col]
> }
> system.time(sapply(1:10,fnn,c=26))     # 0.31 Seconds
> system.time(sapply(91:90,fnn,c=26))    # 1.09 Seconds
> system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds
> 
> Even though I'm only getting the 26th column for 10 sets of records, it
> takes a lot longer the further into the file I go.
> 
> How can I tell scan to pick up where it left off, without it starting at the
> beginning??  There must be a good example somewhere.
> 
> I have done a lot of research (in fact, thank you to Michael J. Crawley and
> others for your help thus far)
> 
> Thanks,
> 
> Gene
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.