[R] read in large data file (tsv) with inline filter?

Thomas Lumley tlumley at u.washington.edu
Tue Mar 24 09:05:35 CET 2009


On Mon, 23 Mar 2009, David Reiss wrote:

> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow.

You certainly don't want to use repeated reads from the start of the file with skip=,  but if you set up a file connection
    fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.

The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list