[R] Reading large, non-tabular files

David Winsemius dwinsemius at comcast.net
Wed Sep 14 16:00:00 CEST 2011


On Sep 14, 2011, at 7:08 AM, Stefan McKinnon Høj-Edwards wrote:

> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in  
> selected lines (e.g. on fourth of the lines); determining which  
> lines depends on the lines content. So far, I have found two  
> approaches for doing this in R; 1) Read the file line by line using  
> a repeat-loop and save the result in a temporary file or a variable,  
> and 2) Read the entire file and filter/reshape it using *apply  
> methods.

Better to use vectorized methods. The `apply functions are really no  
faster than loops.

> To my understanding, the use of repeat{}-loops are quite slow in R,  
> and reading an entire file to discard 3 quarters of the data is a  
> bit of an overkill. Not to mention loading an 650MB text file into  
> memory.
>

Peoples' perception of "large" may vary, and to me that is a medium  
size file. It seems quite likely to fit in most modern computers at  
least for the purpose of eliminating  the undesired rows and then  
having a reduced dataset to write to a working file.

> What I am looking for is a function, that works like the first  
> approach, but avoiding do- or repeat-loops, so I imagine it is  
> implemented in a lower-level language, to be more efficient.  
> Naturally, when calling the function, one would provide a function  
> that determines if/how the line should be appended to a variable.
> Alternatively, an object working as an generator (in Python terms),  
> could be used with the normal *apply functions. I imagine this  
> working differently from e.g. sapply(readLines("myfile.txt"),  
> FUN=selector), in that "readLines" would be executed first, loading  
> the entire file into memory and supplying it to sapply, whereas the  
> generator-object only reads a line when sapply requests the next  
> element.

There are database interfaces to R. You have told us nothing about  
your OS or hardware so it's a bit difficult to match recommendations  
to your specific situation.

>
> Are there options for this kind of operation?


Many, .... once details are provided. This message arrived with useful  
guidance:
----------
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------

David.


More information about the R-help mailing list