[R] R tools for large files

Tue Aug 26 01:46:04 CEST 2003

Murray Jorgensen <maj at stats.waikato.ac.nz> wrote:
	"Large" for my purposes means "more than I really want to read
	into memory" which in turn means "takes more than 30s".  I'm at
	home now and the file isn't so I'm not sure if the file is large
	or not.

I repeat my earlier observation.  The AMOUNT OF DATA is easily handled
a typical desktop machine these days.  The problem is not the amount of
data.  The problem is HOW LONG IT TAKES TO READ.  I made several attempts
to read the test file I created yesterday, and each time gave up
impatiently after 5+ minutes elapsed time.  I tried again today (see below)
and went away to have a cop of tea &c; took nearly 10 minute that time and
still hadn't finished.  'mawk' read _and processed_ the same file
happily in under 30 seconds.

One quite serious alternative would be to write a little C function
to read the file into an array, and call that from R.

> system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41))
[1] 3.28 0.79 4.28 0.00 0.00
> system.time(save(m, file="m.bin"))
[1] 8.44 0.54 9.08 0.00 0.00
> m <- NULL
> system.time(load("m.bin"))
[1] 11.25  0.19 11.51  0.00  0.00
> length(m)
[1] 10250000

The binary file m.bin is 41 million bytes.

This little transcript shows that a data set of this size can be
comfortably read from disc in under 12 seconds, on the same machine
where scan() took about 50 times as long before I killed it.

So yet another alternative is to write a little program that converts
the data file to R binary format, and then just read the whole thing in.
I think readers will agree that 12 seconds on a 500MHz machine counts
as "takes less than 30s".

	It's just that R is so good in reading in initial segments of a file that I
	can't believe that it can't be effective in reading more general
	(pre-specified) subsets.

R is *good* at it, it's just not *quick*.  Trying to select a subset
in scan() or read.table() wouldn't help all that much, because it would
still have to *scan* the data to determine what to skip.

Two more times:
An unoptimised C program writing 0:(41*250000-1) as a file of
41-number lines:
f% time a.out >m.txt
13.0u 1.0s 0:14 94% 0+0k 0+0io 0pf+0w
> system.time(m <- read.table("m.txt", header=FALSE))
^C
Timing stopped at: 552.01 15.48 584.51 0 0 

To my eyes, src/main/scan.c shows no signs of having been tuned for speed.
The goals appear to have been power (the R scan() function has LOTS of
options) and correctness, which are perfectly good goals, and the speed
of scan() and read.table() with modest data sizes is quite good enough.

The huge ratio (>552)/(<30) for R/mawk does suggest that there may be
room for some serious improvement in scan(), possibly by means of some
extra hints about total size, possibly by creating a fast path through
the code.

Of course the big point is that however long scan() takes to read the
data set, it only has to be done once.  Leave R running overnight and
in the morning save the dataset out as an R binary file using save().
Then you'll be able to load it again quickly.