[R] R tools for large files

Wed Aug 27 03:03:39 CEST 2003

Duncan Murdoch <dmurdoch at pair.com> wrote:
	For example, if you want to read lines 1000 through 1100, you'd do it
	like this:

	 lines <- readLines("foo.txt", 1100)[1000:1100]

I created a dataset thus:
# file foo.awk:
BEGIN {
    s = "01"
    for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
    n = (27 * 1024 * 1024) / (length(s) + 1)
    for (i = 1; i <= n; i++) print s
    exit 0
}
# shell command:
mawk -f foo.awk /dev/null >BIG

That is, each record contains 41 2-digit integers, and the number
of records was chosen so that the total size was approximately
27 dimegabytes.  The number of records turns out to be 230,175.

> system.time(v <- readLines("BIG"))
[1] 7.75 0.17 8.13 0.00 0.00
	# With BIG already in the file system cache...
> system.time(v <- readLines("BIG", 200000)[199001:200000])
[1] 11.73  0.16 12.27  0.00  0.00

What's the importance of this?
First, experiments I shall not weary you with showed that the
time to read N lines grows faster than N.
Second, if you want to select the _last_ thousand lines,
you have to read _all_ of them into memory.

For real efficiency here, what's wanted is a variant of readLines
where n is an index vector (a vector of non-negative integers,
a vector of non-positive integers, or a vector of logicals) saying
which lines should be kept.

The function that would need changing is do_readLines() in
src/main/connections.c, unfortunately I don't understand R internals
well enough to do it myself (yet).

As a matter of fact, that _still_ wouldn't yield real efficiency,
because every character would still have to be read by the modified
readLines(), and it reads characters using Rconn_fgetc(), which is
what gives readLines() its power and utility, but certainly doesn't
give it wings.  (One of the fundamental laws of efficient I/O library
design is to base it on block- or line- at-a-time transfers, not
character-at-a-time.)

The AWK program
    NR <= 199000 { next }
    {print}
    NR == 200000 { exit }
extracts lines 199001:20000 in just 0.76 seconds, about 15 times
faster.  A C program to the same effect, using fgets(), took 0.39
seconds, or about 30 times faster than R.

There are two fairly clear sources of overhead in the R code:
(1) the overhead of reading characters one at a time through Rconn_fgetc()
    instead of a block or line at a time.  mawk doesn't use fgets() for
    reading, and _does_ have the overhead of repeatedly checking a
    regular expression to determine where the end of the line is,
    which it is sensible enough to fast-path.
(2) the overhead of allocating, filling in, and keeping, a whole lot of
    memory which is of no use whatever in computing the final result.
    mawk is actually fairly careful here, and only keeps one line at
    a time in the program shown above.  Let's change it:
	NR <= 199000 {next}
	{a[NR] = $0}
	NR == 200000 {exit}
	END {for (i in a) print a[i]}
    That takes the time from 0.76 seconds to 0.80 seconds

The simplest thing that could possibly work would be to add a function
skipLines(con, n) which simply read and discarded n lines.

	 result <- scan(textConnection(lines), list( .... ))

> system.time(m <- scan(textConnection(v), integer(41)))
Read 41000 items
[1] 0.99 0.00 1.01 0.00 0.00

One whole second to read 41,000 numbers on a 500 MHz machine?

> vv <- rep(v, 240)

Is there any possibility of storing the data in (platform) binary form?
Binary connections (R-data.pdf, section 6.5 "Binary connections") can be
used to read binary-encoded data.

I wrote a little C program to save out the 230175 records of 41 integers
each in native binary form.  Then in R I did

> system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))
[1] 0.57 0.52 1.11 0.00 0.00
> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))
[1] 2.55 0.34 2.95 0.00 0.00

Remember, this doesn't read a *sample* of the data, it reads *all*
the data.  It is so much faster than the alternatives in R that it
just isn't funny.  Trying scan() on the file took nearly 10 minutes
before I killed it the other day, using readBin() is a thousand times
faster than a simple scan() call on this particular data set.

There has *got* to be a way of either generating or saving the data
in binary form, using only "approved" Windows tools.  Heck, it can
probably be done using VBA.

By the way, I've read most of the .pdf files I could find on the CRAN site,
but haven't noticed any description of the R save-file format.  Where should
I have looked?  (Yes, I know about src/main/saveload.c; I was hoping for
some documentation, with maybe some diagrams.)