[R] reading very large files

Henrik Bengtsson hb at stat.berkeley.edu
Fri Feb 2 19:22:41 CET 2007


Forgot to say, in your script you're reading the rows unordered
meaning you're jumping around in the file and there is no way the
hardware or the file caching system can optimize that.  I'm pretty
sure you would see a substantial speedup if you did:

sel <- sort(sel);

/H

On 2/2/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> Hi.
>
> General idea:
>
> 1. Open your file as a connection, i.e. con <- file(pathname, open="r")
>
> 2. Generate a "row to (file offset, row length) map of your text file,
> i.e. a numeric vector 'fileOffsets' and 'rowLengths'.  Use readBin()
> for this. You build this up as you go by reading the file in chunks
> meaning you can handles files of any size.  You can store this lookup
> map to file for your future R sessions.
>
> 3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows =
> sample(length(fileOffsets)).
>
> 4. Look up the file offsets and row lengths for these rows, i.e.
> offsets = fileOffsets[rows].  lengths = rowLengths[rows].
>
> 5. In case your subset of rows is not ordered, it is wise to order
> them first to speed up things.  If order is important, keep track of
> the ordering and re-order them at then end.
>
> 6. For each row r, use seek(con=con, where=offsets[r]) to jump to the
> start of the row.  Use readBin(..., n=lengths[r]) to read the data.
>
> 7. Repeat from (3).
>
> /Henrik
>
> On 2/2/07, juli g. pausas <pausas at gmail.com> wrote:
> > Hi all,
> > I have a large file (1.8 GB) with 900,000 lines that I would like to read.
> > Each line is a string characters. Specifically I would like to randomly
> > select 3000 lines. For smaller files, what I'm doing is:
> >
> > trs <- scan("myfile", what= character(), sep = "\n")
> > trs<- trs[sample(length(trs), 3000)]
> >
> > And this works OK; however my computer seems not able to handle the 1.8 G
> > file.
> > I thought of an alternative way that not require to read the whole file:
> >
> > sel <- sample(1:900000, 3000)
> > for (i in 1:3000)  {
> > un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1)
> > write(un, "myfile_short", append=TRUE)
> > }
> >
> > This works on my computer; however it is extremely slow; it read one line
> > each time. It is been running for 25 hours and I think it has done less than
> > half of the file (Yes, probably I do not have a very good computer and I'm
> > working under Windows ...).
> > So my question is: do you know any other faster way to do this?
> > Thanks in advance
> >
> > Juli
> >
> > --
> > http://www.ceam.es/pausas
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



More information about the R-help mailing list