[R] reading very large files

Fri Feb 2 19:42:03 CET 2007

On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
> On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
> > Hi all,
> > I have a large file (1.8 GB) with 900,000 lines that I would like to read.
> > Each line is a string characters. Specifically I would like to randomly
> > select 3000 lines. For smaller files, what I'm doing is:
> > 
> > trs <- scan("myfile", what= character(), sep = "\n")
> > trs<- trs[sample(length(trs), 3000)]
> > 
> > And this works OK; however my computer seems not able to handle the 1.8 G
> > file.
> > I thought of an alternative way that not require to read the whole file:
> > 
> > sel <- sample(1:900000, 3000)
> > for (i in 1:3000)  {
> > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1)
> > write(un, "myfile_short", append=TRUE)
> > }
> > 
> > This works on my computer; however it is extremely slow; it read one line
> > each time. It is been running for 25 hours and I think it has done less than
> > half of the file (Yes, probably I do not have a very good computer and I'm
> > working under Windows ...).
> > So my question is: do you know any other faster way to do this?
> > Thanks in advance
> > 
> > Juli
> 
> 
> Juli,
> 
> I don't have a file to test this on, so caveat emptor.
> 
> The problem with the approach above, is that you are re-reading the
> source file, once per line, or 3000 times.  In addition, each read is
> likely going through half the file on average to locate the randomly
> selected line. Thus, the reality is that you are probably reading on the
> order of:
> 
> > 3000 * 450000
> [1] 1.35e+09
> 
> lines in the file, which of course if going to be quite slow.
> 
> In addition, you are also writing to the target file 3000 times.
> 
> The basic premise with this approach below, is that you are in effect
> creating a sequential file cache in an R object. Reading large chunks of
> the source file into the cache. Then randomly selecting rows within the
> cache and then writing out the selected rows.
> 
> Thus, if you can read 100,000 rows at once, you would have 9 reads of
> the source file, and 9 writes of the target file.
> 
> The key thing here is to ensure that the offsets within the cache and
> the corresponding random row values are properly set.
> 
> Here's the code:
> 
> # Generate the random values
> sel <- sample(1:900000, 3000)
> 
> # Set up a sequence for the cache chunks
> # Presume you can read 100,000 rows at once
> Cuts <- seq(0, 900000, 100000)
> 
> # Loop over the length of Cuts, less 1
> for (i in seq(along = Cuts[-1]))
> {
>   # Get a 100,000 row chunk, skipping rows
>   # as appropriate for each subsequent chunk
>   Chunk <- scan("myfile", what = character(), sep = "\n", 
>                  skip = Cuts[i], nlines = 100000)
> 
>   # set up a row sequence for the current 
>   # chunk
>   Rows <- (Cuts[i] + 1):(Cuts[i + 1])
> 
>   # Are any of the random values in the 
>   # current chunk?
>   Chunk.Sel <- sel[which(sel %in% Rows)]
> 
>   # If so, get them 
>   if (length(Chunk.Sel) > 0)
>   {
>     Write.Rows <- Chunk[sel - Cuts[i]]

Quick typo correction:

The last line above should be:

      Write.Rows <- Chunk[sel - Cuts[i], ]

>     # Now write them out
>     write(Write.Rows, "myfile_short", append = TRUE)
>   }
> }
> 
> 
> As noted, I have not tested this, so there may yet be additional ways to
> save time with file seeks, etc.

If that's the only error in the code...   :-)

Marc