[R] Reading large, non-tabular files

Rainer Schuermann rainer.schuermann at gmx.net
Wed Sep 14 16:06:48 CEST 2011


That looks like a perfect job for (g)awk which is in every Linux distribution 
but also available for Windows.
It can be called with something like

system( "awk -f script.awk inputfile.txt" )

and does its job silently and very fast. 650MB should not be an issue. I'm not 
proficient in awk but would offer my help anyway (off-list...).

Rgds,
Rainer


On Wednesday 14 September 2011 13:08:14 Stefan McKinnon Høj-Edwards wrote:
> Dear R-help,
> 
> I have a very large ascii data file, of which I only want to read in
> selected lines (e.g. on fourth of the lines); determining which lines
> depends on the lines content. So far, I have found two approaches for doing
> this in R; 1) Read the file line by line using a repeat-loop and save the
> result in a temporary file or a variable, and 2) Read the entire file and
> filter/reshape it using *apply methods. To my understanding, the use of
> repeat{}-loops are quite slow in R, and reading an entire file to discard 3
> quarters of the data is a bit of an overkill. Not to mention loading an
> 650MB text file into memory.
> 
> What I am looking for is a function, that works like the first approach, but
> avoiding do- or repeat-loops, so I imagine it is implemented in a
> lower-level language, to be more efficient. Naturally, when calling the
> function, one would provide a function that determines if/how the line
> should be appended to a variable. Alternatively, an object working as an
> generator (in Python terms), could be used with the normal *apply
> functions. I imagine this working differently from e.g.
> sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be
> executed first, loading the entire file into memory and supplying it to
> sapply, whereas the generator-object only reads a line when sapply requests
> the next element.
> 
> Are there options for this kind of operation?
> 
> Kind regards,
> 
> Stefan McKinnon Høj-Edwards     Dept. of Genetics and Biotechnology
> PhD student                     Faculty of Agricultural Sciences
> stefan.hoj-edwards at agrsci.dk    Aarhus University
> Tel.: +45 8999 1291             Blichers Allé 20, Postboks 50
> Web: www.iysik.com              DK-8830 Tjele
>                                 Tel.: +45 8999 1900
>                                 Web: www.agrsci.au.dk
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list