[R] Reading large, non-tabular files

jim holtman jholtman at gmail.com
Wed Sep 14 15:39:06 CEST 2011


What is overkill about reading in a 650MB text file if you have the
space?  You are going to have to process one way or another.  I would
use 'readLines' to read it in, and then 'grepl' to determine which
lines I want to keep and then delete the rest, and then write the new
file out.  At this point I can probably use 'read.table' to now
process the new file.  This works pretty fast if you can apply pattern
matching to determine which lines you want to keep.

If you don't have the memory to read in the whole file, then setup a
look and read in whatever amount makes sense (e.g., 100MB at a time),
and then do the processing above with the output file opened at the
beginning so that you continue to add to it.

You probably need to state what type of criteria you would be applying
to the lines to determine if you want to keep them.

You can also use perl, sed awk, .... to do the processing

2011/9/14 Stefan McKinnon Høj-Edwards <Stefan.Hoj-Edwards at agrsci.dk>:
> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in selected lines (e.g. on fourth of the lines); determining which lines depends on the lines content. So far, I have found two approaches for doing this in R; 1) Read the file line by line using a repeat-loop and save the result in a temporary file or a variable, and 2) Read the entire file and filter/reshape it using *apply methods.
> To my understanding, the use of repeat{}-loops are quite slow in R, and reading an entire file to discard 3 quarters of the data is a bit of an overkill. Not to mention loading an 650MB text file into memory.
>
> What I am looking for is a function, that works like the first approach, but avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level language, to be more efficient. Naturally, when calling the function, one would provide a function that determines if/how the line should be appended to a variable.
> Alternatively, an object working as an generator (in Python terms), could be used with the normal *apply functions. I imagine this working differently from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be executed first, loading the entire file into memory and supplying it to sapply, whereas the generator-object only reads a line when sapply requests the next element.
>
> Are there options for this kind of operation?
>
> Kind regards,
>
> Stefan McKinnon Høj-Edwards     Dept. of Genetics and Biotechnology
> PhD student                     Faculty of Agricultural Sciences
> stefan.hoj-edwards at agrsci.dk    Aarhus University
> Tel.: +45 8999 1291             Blichers Allé 20, Postboks 50
> Web: www.iysik.com              DK-8830 Tjele
>                                Tel.: +45 8999 1900
>                                Web: www.agrsci.au.dk
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list