[R] Reading large, non-tabular files

jim holtman jholtman at gmail.com
Wed Sep 14 17:35:16 CEST 2011


Here how long it might take in R to do.   I created a file of 558MB
and then read it it, found lines that had ''76095' in them and then
wrote those out:

> system.time(x <- readLines('tempyy'))  # read in the 558MB file
   user  system elapsed
  65.91    0.82   67.40
> object.size(x)
63348864 bytes
> str(x)  # 14M lines of data
 chr [1:14276304] "\"Locationname\",\"N Units\",\"Wskusku\"" ...
> system.time(indx <- grepl("76095", x))  # grep for the criteria
   user  system elapsed
  10.78    0.02   11.46
> system.time(writeLines(x[indx], 'tempzz'))  # write the 1152 matching lines
   user  system elapsed
   0.13    0.03    0.23
> sum(indx)
[1] 1152
>

On Wed, Sep 14, 2011 at 10:06 AM, Rainer Schuermann
<rainer.schuermann at gmx.net> wrote:
> That looks like a perfect job for (g)awk which is in every Linux distribution
> but also available for Windows.
> It can be called with something like
>
> system( "awk -f script.awk inputfile.txt" )
>
> and does its job silently and very fast. 650MB should not be an issue. I'm not
> proficient in awk but would offer my help anyway (off-list...).
>
> Rgds,
> Rainer
>
>
> On Wednesday 14 September 2011 13:08:14 Stefan McKinnon Høj-Edwards wrote:
>> Dear R-help,
>>
>> I have a very large ascii data file, of which I only want to read in
>> selected lines (e.g. on fourth of the lines); determining which lines
>> depends on the lines content. So far, I have found two approaches for doing
>> this in R; 1) Read the file line by line using a repeat-loop and save the
>> result in a temporary file or a variable, and 2) Read the entire file and
>> filter/reshape it using *apply methods. To my understanding, the use of
>> repeat{}-loops are quite slow in R, and reading an entire file to discard 3
>> quarters of the data is a bit of an overkill. Not to mention loading an
>> 650MB text file into memory.
>>
>> What I am looking for is a function, that works like the first approach, but
>> avoiding do- or repeat-loops, so I imagine it is implemented in a
>> lower-level language, to be more efficient. Naturally, when calling the
>> function, one would provide a function that determines if/how the line
>> should be appended to a variable. Alternatively, an object working as an
>> generator (in Python terms), could be used with the normal *apply
>> functions. I imagine this working differently from e.g.
>> sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be
>> executed first, loading the entire file into memory and supplying it to
>> sapply, whereas the generator-object only reads a line when sapply requests
>> the next element.
>>
>> Are there options for this kind of operation?
>>
>> Kind regards,
>>
>> Stefan McKinnon Høj-Edwards     Dept. of Genetics and Biotechnology
>> PhD student                     Faculty of Agricultural Sciences
>> stefan.hoj-edwards at agrsci.dk    Aarhus University
>> Tel.: +45 8999 1291             Blichers Allé 20, Postboks 50
>> Web: www.iysik.com              DK-8830 Tjele
>>                                 Tel.: +45 8999 1900
>>                                 Web: www.agrsci.au.dk
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list