[R] Reading large, non-tabular files

Gabor Grothendieck ggrothendieck at gmail.com
Wed Sep 14 15:57:07 CEST 2011


2011/9/14 Stefan McKinnon Høj-Edwards <Stefan.Hoj-Edwards at agrsci.dk>:
> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in selected lines (e.g. on fourth of the lines); determining which lines depends on the lines content. So far, I have found two approaches for doing this in R; 1) Read the file line by line using a repeat-loop and save the result in a temporary file or a variable, and 2) Read the entire file and filter/reshape it using *apply methods.
> To my understanding, the use of repeat{}-loops are quite slow in R, and reading an entire file to discard 3 quarters of the data is a bit of an overkill. Not to mention loading an 650MB text file into memory.
>
> What I am looking for is a function, that works like the first approach, but avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level language, to be more efficient. Naturally, when calling the function, one would provide a function that determines if/how the line should be appended to a variable.
> Alternatively, an object working as an generator (in Python terms), could be used with the normal *apply functions. I imagine this working differently from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be executed first, loading the entire file into memory and supplying it to sapply, whereas the generator-object only reads a line when sapply requests the next element.
>
> Are there options for this kind of operation?
>


read.csv.sql in the sqldf package can read a file and deliver just a
subset to R.  The portion desired is specified using sql and the
entire operation can be done in a single line of code.  It can handle
files too large to read into R since only the portion desired is ever
read into R itself.  See Example 13 on the sqldf home page:

http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql

and also read ?read.csv.sql .


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list