[R] reading very large files

Fri Feb 2 20:04:04 CET 2007

On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote:
> On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:

> > Juli,
> > 
> > I don't have a file to test this on, so caveat emptor.
> > 
> > The problem with the approach above, is that you are re-reading the
> > source file, once per line, or 3000 times.  In addition, each read is
> > likely going through half the file on average to locate the randomly
> > selected line. Thus, the reality is that you are probably reading on the
> > order of:
> > 
> > > 3000 * 450000
> > [1] 1.35e+09
> > 
> > lines in the file, which of course if going to be quite slow.
> > 
> > In addition, you are also writing to the target file 3000 times.
> > 
> > The basic premise with this approach below, is that you are in effect
> > creating a sequential file cache in an R object. Reading large chunks of
> > the source file into the cache. Then randomly selecting rows within the
> > cache and then writing out the selected rows.
> > 
> > Thus, if you can read 100,000 rows at once, you would have 9 reads of
> > the source file, and 9 writes of the target file.
> > 
> > The key thing here is to ensure that the offsets within the cache and
> > the corresponding random row values are properly set.
> > 
> > Here's the code:
> > 
> > # Generate the random values
> > sel <- sample(1:900000, 3000)
> > 
> > # Set up a sequence for the cache chunks
> > # Presume you can read 100,000 rows at once
> > Cuts <- seq(0, 900000, 100000)
> > 
> > # Loop over the length of Cuts, less 1
> > for (i in seq(along = Cuts[-1]))
> > {
> >   # Get a 100,000 row chunk, skipping rows
> >   # as appropriate for each subsequent chunk
> >   Chunk <- scan("myfile", what = character(), sep = "\n", 
> >                  skip = Cuts[i], nlines = 100000)
> > 
> >   # set up a row sequence for the current 
> >   # chunk
> >   Rows <- (Cuts[i] + 1):(Cuts[i + 1])
> > 
> >   # Are any of the random values in the 
> >   # current chunk?
> >   Chunk.Sel <- sel[which(sel %in% Rows)]
> > 
> >   # If so, get them 
> >   if (length(Chunk.Sel) > 0)
> >   {
> >     Write.Rows <- Chunk[sel - Cuts[i]]
> 
> 
> Quick typo correction:
> 
> The last line above should be:
> 
>       Write.Rows <- Chunk[sel - Cuts[i], ]
> 
> 
> >     # Now write them out
> >     write(Write.Rows, "myfile_short", append = TRUE)
> >   }
> > }
> > 

OK, I knew it was too good to be true...

One more correction on that same line:

   Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ]

For clarity, here is the full set of code:

# Generate the random values
sel <- sample(900000, 3000)

# Set up a sequence for the cache chunks
# Presume you can read 100,000 rows at once
Cuts <- seq(0, 900000, 100000)

# Loop over the length of Cuts, less 1
for (i in seq(along = Cuts[-1]))
{
  # Get a 100,000 row chunk, skipping rows
  # as appropriate for each subsequent chunk
  Chunk <- scan("myfile", what = character(), sep = "\n", 
                 skip = Cuts[i], nlines = 100000)

  # set up a row sequence for the current 
  # chunk
  Rows <- (Cuts[i] + 1):(Cuts[i + 1])

  # Are any of the random values in the 
  # current chunk?
  Chunk.Sel <- sel[which(sel %in% Rows)]

  # If so, get them 
  if (length(Chunk.Sel) > 0)
  {
    Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ]

    # Now write them out
    write(Write.Rows, "myfile_short", append = TRUE)
  }
}

Regards,

Marc